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1. INTRODUCTION 

Web Data Extraction systems are a broad class of software applications targeting 
at extracting information from Web sources like Web pages [Laender ct al. 2002; 
Baumgartner et al. 2009]. A Web Data Extraction system usually interacts with a 
Web source and extracts data stored in it: for instance, if the source is a HTML 
Web page, the extracted information could consist of elements in the page as well as 
the full-text of the page itself. Eventually, extracted data might be post-processed, 
converted in the most convenient structured format and stored for further usage 
[Zhao 2007; Irmak and Sucl 2006]. 

Web Data Extraction systems find extensive use in a wide range of applications 
like the analysis of text documents at disposal of a firm (like e-mails, support 
forum, technical and legal documentation, and so on), Business and Competitive 
Intelligence [Baumgartner et al. 2005], crawling of Social Web platforms [Catanese 
et al. 2011; Gjoka ct al. 2010], Bio-Informatics [Plakc ct al. 2006] and so on. The 
importance of Web Data Extraction systems depends on the fact that, today, a 
large (and quickly growing) amount of information is continuously produced, shared 
and consumed online: Web Data Extraction systems allow to efficiently collect 
this information with a limited human effort. The availability and analysis of 
collected data is an indefeasible requirement to understand complex social, scientific 
and economic phenomena which generated the information itself. So, for instance, 
collecting digital traces produced by human users in Social Web platforms like 
Facebook, YouTube or Flickr is the key step to verify sociological theories on a large 
scale [Klcinbcrg 2000; Backstrom et al. 2011] or to check whether mathematical 
models developed in the field of Complex Networks [Newman 2003] are able to 
correctly explain human behaviors. 

In the commercial field, the Web provides a wealth of public domain information 
which can be retrieved, for instance, from the Web page of a firm or from an c- 
commerce platform. A firm can probe the Web to acquire and analyze information 
about the activities of its competitors; such a process is generally known as Com- 
petitive Intelligence [Chen et al. 2002; Zanasi 1998] and it is a success factor for the 
firm: the firm, in fact, is able to quickly identify the opportunities provided by the 
market, to anticipate the activities of its competitors as well as to learn from their 
faults and successes. 

1.1 Challenges of Web Data Extraction techniques 

The design and implementation of Web Data Extraction Systems has been con- 
sidered from different perspectives and it leverages on scientific results and tools 
coming from various disciplines like Machine Learning, Logic and Natural Language 
Processing. 

In the design of a Web Data Extraction system, many factors must be taken into 
account; some of these factors are independent of the specific application domain 
in which we plan to perform Web Data Extraction. Other factors, instead, heavily 
depend on the particular features of the application domain: as a consequence, some 
technological solutions which appear to be effective in some application contexts 
are not suitable in other ones. To better clarify this concept, observe that some 
approaches focus on static HTML Web pages and use the tags composing a page 
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along with their hierarchical organization to extract information. These approaches 
are able to achieve a high level of accuracy as well as to efficiently scale over large 
document collections, but, unfortunately, the results they obtain are no longer 
replicable if other kind of Web sources are considered (e.g., Web pages dynamically 
generated after filling a Web form). 

Such a reasoning tells us that, in its most general formulation, the problem of 
extracting data from the Web is hard because it is constrained by several require- 
ments. The key challenges we can encounter in the design of a Web Data Extraction 
system can be summarized as follows: 

— Web Data Extraction techniques often require the help of human experts. A first 
challenge consists of providing a high degree of automation by reducing human 
efforts as much as possible. Human feedbacks, however, may play an important 
role in raising the level of accuracy achieved by a Web Data Extraction system. 
A related challenge is, therefore, to identify a reasonable trade-off between the 
need of building highly automated Web Data Extraction procedures and the 
requirement of achieving highly accurate performance. 

— Web Data Extraction techniques should be able to process large volumes of data 
in relatively short time. Such a need is particularly urgent in the field of Business 
and Competitive Intelligence because a firm needs to perform timely analysis of 
market conditions. 

— Applications in the field of Social Web or, more in general, applications dealing 
with human related data must provide solid privacy guarantees. Therefore, po- 
tential (even if unintentional) attempts to violate user privacy should be timely 
and adequately identified and advertised. 

— Approaches relying on Machine Learning often require a significantly large train- 
ing set of manually labeled Web pages. In general, the task of labeling pages is 
time-expensive and error-prone and, therefore, in many cases we can not assume 
the existence of labeled pages. 

— In some cases, a Web Data Extraction tool has to routinely extract data from 
a Web Data source which can evolve over time. Web sources are continuously 
evolving and structural changes happen with no forewarning, thus are unpre- 
dictable. Eventually, in real-world scenarios it emerges the need of maintaining 
these systems, that might stop working correctly if lacking of flexibility to detect 
and face structural modifications of related Web sources. 

1.2 Related works 

The theme of Web Data Extraction is covered by a number of reviews. Laender 
et al. [Laender ct al. 2002], presented a survey that offers a rigorous taxonomy 
to classify Web Data Extraction systems. They introduced a set of criteria and a 
qualitative analysis of various Web Data Extraction tools. 

Kushmerick [Kushmerick 2002] tracked a profile of finite-state approaches to the 
Web Data Extraction problem. The author analyzed both wrapper induction ap- 
proaches (i.e., approaches capable of automatically generating wrappers by exploit- 
ing suitable examples) and maintenance ones (i.e., the update of a wrapper each 
time the structure of the Web source changes). In [Kushmerick 2002], Web Data Ex- 
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traction techniques derived from Natural Language Processing and Hidden Markov 
Models were also discussed. 

On the wrapper induction problem, Flesca et al. [Flesca et al. 2004] and Kaiser 
and Miksch [Kaiser and Miksch 2005] surveyed approaches, techniques and tools. 
The latter paper, in particular, provided a model describing the architecture of an 
Information Extraction system. 

Chang et al. [Chang et al. 2006] introduced a tri- dimensional categorization 
of Web Data Extraction systems, based on task difficulties, techniques used and 
degree of automation. Fiumara [Fiumara 2007] applied these criteria to classify four 
among the state-of-the-art Web Data Extraction systems as of 2007. A relevant 
work on Information Extraction is due to Sarawagi [Sarawagi 2008] and, in our 
opinion, anybody who intends to approach this discipline should read it. To the 
best of our knowledge, the latest work from Baumgartner et al. [Baumgartner et al. 
2009] is the most updated survey on the state-of-the-art of the discipline as of this 
date. 

1.3 Our contribution 

The goal of this survey is to provide a structured and comprehensive overview of 
the research efforts made in the field of Web Data Extraction and as well as to 
provide an overview of most recent results in the literature. 

In addition, we try to adopt a point of view different with respect to that used 
in other survey on this discipline: most papers, in fact, presented a list of tools, re- 
porting a feature-based classification or an experimental comparison of these tools. 
Many of these papers are solid starting points in the study of this area. Unlike the 
existing surveys, our ambition is to provide a classification of existing Web Data 
Extraction techniques in terms of the application domains in which they have been 
employed. We want to shed light on the various research directions made in this 
field as well as to understand to what extent techniques initially applied in a par- 
ticular application domain have been later re-used in others. To the best of our 
knowledge, this is the first survey deeply analyzing time a Web Data Extraction 
system from a perspective of their application fields. 

Although, we also provide a detailed discussion of techniques to perform Web 
Data Extraction. We identify two main categories, i.e., approaches based on Tree 
Matching algorithms and approaches based on Machine Learning algorithms. For 
each category, we first describe the basic techniques employed in that category and 
after this we illustrate their variants. We show also how each category addresses 
the problems of automatic wrapper generation and maintenance. After that, we 
focus on applications that are strictly interconnected with Web Data Extraction 
tasks. We want to cover in particular enterprise, social and scientific applications 
and discover which fields have already been approached (e.g., advertising engineer- 
ing, enterprise solutions, Business and Competitive intelligence, etc.) and which 
are going to be, looking ahead the immediate future (e.g., Bio- informatics, Web 
Harvesting, etc.). 

We also discuss about the potential of cross-fertilization, i.e., whether strate- 
gies employed in a given domain can be re-used in others or, vice versa, if some 
applications can be adopted only in particular application domains. 
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1.4 Organization of the survey 

This survey is organized into two main parts. The first part is devoted to provide 
some general definitions which are helpful to understand the material proposed 
in the survey. To this purpose, Section 2 illustrates the techniques exploited for 
collecting data from Web sources and the algorithms that underly most of Web Data 
Extraction systems. The main features of existing Web Data Extraction systems 
are largely discussed in Section 3. 

The second part of this work is about the applications of Web Data Extraction 
systems to real-world scenarios. In detail, in Section 4 we identified two main 
domains in which Web Data Extraction techniques have been employed, i.e., ap- 
plications at the enterprise level and at the Social Web level. Application at the 
enterprise level are described in Section 4.1, whereas applications at the Social Web 
level are covered in Section 4.2. This part concludes discussing the opportunities 
of cross-fertilization among different application scenarios (see Section 4.3). 

Finally, in Section 5 we draw our conclusions and discuss novel applications of 
Web Data Extraction techniques that might arise in the future. 

2. TECHNIQUES 

The first part of this survey is dedicated to the discussion of the techniques adopted 
in the field of the Web Data Extraction. The first trials to extract data from the 
Web are dated back in early nineties, as reported by [Kaiser and Miksch 2005; 
Sarawagi 2008]. 

In the early stage, this discipline borrowed approaches and techniques from In- 
formation Extraction (IE) literature. In particular, two classes of strategies emerge 
[Kaiser and Miksch 2005]: learning techniques and knowledge engineering techniques 
- also called as learning-based and rule-based approaches, respectively [Sarawagi 
2008]. These classes share a common rationale: the former was thought to develop 
systems that require human expertise to define rules (for example, regular expres- 
sions or program snippets) to successfully accomplish the data extraction. These 
approaches require specific domain expertise: users that design and implement the 
rules and train the system must have programming experience and a good knowl- 
edge of the domain in which the data extraction system will operate and, possibly, 
the ability of envisaging potential usage scenarios and tasks assigned to the system. 

On the other hand, also some approaches of the latter class involve strong famil- 
iarity with both the requirements and the functions of the platform, so the human 
engagement is essential. 

To reduce the commitment of human domain experts, a number of strategies 
have been devised. Some of them have been developed in the context of Artificial 
Intelligence literature, by means of the adoption of specific algorithms using the 
structure of Web pages to identify and extract data. Some others are borrowed 
from the Machine Learning discipline, thought as supervised or semi-supervised 
learning techniques which allow the design of systems capable of being trained by 
examples and then to autonomously extract data from similar (or even different) 
domains. 

In the following we will discuss separately these two research lines. In detail, 
Section 2.1 presents those strategy based on the definition of algorithms capable 
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of identifying information exploiting the semi-structured nature of Web pages. In 
Section 2.2 we introduce the concept of Web wrappers and explain how the tech- 
niques discussed in Section 2.1 are incorporated into Web wrappers, and discuss 
techniques for their generation and maintenance. Section 2.3, instead, poses the 
attention on Machine Learning strategies relying on the possibility of learning from 
labeled examples to extract information from not-previously seen Web pages. 

2.1 Tree-based techniques 

One of the most exploited features in Web Data Extraction is the semi-structured 
nature of Web pages. These can be naturally represented as labeled ordered rooted 
trees, where labels represent the tags proper of the HTML mark-up language syn- 
tax, and the tree hierarchy represents the different levels of nesting of elements 
constituting the Web page. The representation of a Web page by using a labeled 
ordered rooted tree is usually referred as DOM (Document Object Model), whose 
detailed explanation is out of the scope of this survey but has been largely regulated 
by the World Wide Web Consortium 1 . The general idea behind the Document Ob- 
ject Model is that HTML Web pages are represented by means of plain text, which 
contains HTML tags, i.e., particular keywords defined in the mark-up language 
that can be interpreted by the browser to represent the elements specific of a Web 
page (e.g., hyper-links, buttons, images and so forth), so as free-text. HTML tags 
may be nested one into another, forming a hierarchical structure. This hierarchy 
is captured in the DOM by the document tree, whose nodes represent HTML tags. 
The document tree (henceforth also referred as DOM tree) has been successfully 
exploited for Web Data Extraction purposes in a number of techniques discussed 
in the following. 

2.1.1 Addressing elements in the document tree: XPath. One of the main ad- 
vantage of the adoption of the Document Object Model for the HTML language is 
the possibility of exploiting some tools typical of XML languages (and HTML is to 
all effects a dialect of the XML). In particular, the XML Path Language (or, briefly, 
XPath) provides with a powerful syntax to address specific elements of a XML doc- 
ument (and, to the same extent, of HTML Web pages) in a simple manner. XPath 
has been defined by the World Wide Web Consortium 2 , so as DOM. 

Although describing the syntax of XPath is not the core argument of this section, 
we provide Figure 1 as an example to explain how XPath can be used to address 
elements of a Web pages. There exist two possible ways to use XPath: (i) to identify 
a single element in the document tree, or (ii) to address multiple occurrences of the 
same element. In the former case, illustrated in Figure 1(A), the defined XPath 
identifies just a single element on the Web page (namely, a table cell); in the latter, 
showed in Figure 1(B), the XPath identifies multiple instances of the same type of 
element (still a table cell) sharing the same hierarchical location. 

To the purpose of Web Data Extraction, the possibility of exploiting a such pow- 
erful tool has been of utmost importance and the adoption of XPath as the tool 
to address elements in Web page has been largely exploited in the literature. The 



For further details consider the following link: http://www.w3.org/DOM/ 
2 For the specifications sec: http://www.w3.org/TR/xpath/ 
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(A) /html[l]/body[l]/table[l]/tr[l]/td[l] 



html 




head body 



table 




td td td 



(B) /html[l]/body[l]/table[l]/tr[2]/td 



html 




td td td 



Fig. 1. Example of XPath(s) on the document tree, selecting one (A) or multiple (B) items. 



major weakness of XPath is related to its lack of flexibility: each XPath expression 
is strictly related to the structure of the Web page on top of which it has been de- 
fined (this limitation has been partially mitigated by the definition of relative path 
expressions instead of only absolute ones, in latest releases, e.g., XPath 2.0 3 ). This 
implies that even minor changes applied to the structure of a Web page might cor- 
rupt the correct functioning of a XPath expression defined on the previous version 
of the page itself. The emergence of such a problem let us introduce some tech- 
niques, usually referred as tree matching strategies, devoted to assess and measure 
the similarity among different documents, so as to identify structural variations in 
their DOM trees. These techniques, discussed in the following, have been exploited 
in the literature to overcome different situations in which flexibility in the iden- 
tification of elements within Web pages was a requirement to make a Web Data 
Extraction platform robust and reliable. 

2.1.2 Tree edit distance matching algorithms. The first technique we describe is 
called tree edit distance matching. The problem of computing the tree edit distance 
between trees is a variation on the theme of the classic string edit distance prob- 
lem. Given two labeled ordered rooted trees A and B, the problem is to finding 
a matching to transform A in B (or the vice-versa) with the minimum number of 
operations. The set of possible operations consists of node deletion, insertion or 
replacement. At each operation might be applied a cost, and in that case, the task 
turns in a cost-minimization problem (i.e., finding the sequence of operations of 
minimum cost to transform A in B). 

The reasoning above is formally encoded in the definition of mapping, borrowed 
by [Liu 2011]. A mapping M between two trees A and B is defined as a set of 
ordered pairs (i, j), one from each tree, satisfying the following conditions V 
ji), (*2, .72 ) G M. 

(1) ii = i 2 if and only if ji = j 2 

(2) A[ii) is on the left of A[i 2 ] if and only if B[ii] is on the left of B[i 2 ] 

(3) A[i\] is an ancestor of A[i 2 ] if and only if B[ii] is an ancestor of B[i 2 ] 

where with the notation A[i x ] we indicate the x-th node of the tree A in a pre- 
order visit of the tree. 



3 http://www. w3.org/TR/xpath20/ 
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A number of intuitive consequences emerge from this definition of mapping: 

— Each node must appear no more than once in a mapping; 

— The order among siblings is preserved; 

— The hierarchical relations among nodes is unchanged. 

A number of techniques to approach this problem have been proposed [Wang et al. 
1998; Chen 2001], all providing the support to all three types of operations on nodes 
(i.e., node deletion, insertion and replacement) but plagued by high computational 
costs. In addition, in [Zhang et al. 1992] it has been proved that the formulation 
for non ordered trees is NP-complete. 

The simple tree matching algorithm. A computationally efficient solution for the 
problem of the tree edit distance matching is provided by the algorithm called simple 
tree matching [Selkow 1977], and its variants. This optimized strategy comes at a 
cost: node replacement is not an allowed operation during the matching procedure 
- the shortcomings of this aspects will be further discussed below. 

The pseudo-code of the simple tree matching algorithm is provided in Algorithm 
1, that adopts the following notation: d(n) represents the degree of a node n (i.e., 
the number of first-level children); T(i) is the i-th subtree of the tree rooted at node 
T. 



Algorithm 1 SimpleTreeMatching(T', T") 

1: if T has the same label of T then 

2: m^d(T) 

3: n^d{T) 

4: for i = to m do 

5: M[i][0]^0; 

6: end for 

7: for j = to n do 

8: M[0][j]<-0; 

9: end for 

10: for all i such that 1 < i < m do 
11: for all j such that 1 < j < n do 

12: M\i}[j] <r- Max(M[i][j - 1], M[i - l][j], M[i - l][j - 1] + W[i]\j\) where 

W[i]\j] = SimpleTreeMatching(T' (i - 1), T" (j - 1)) 
13: end for 
14: end for 
15: return M[m][n]+1 
16: else 
17: return 
18: end if 



The computational cost of the simple tree matching is 0(nodes(A) • nodes(B)), 
where nodes(T) is the function that returns the number of nodes in a tree (or a 
sub-tree) T; the low cost ensures excellent performance when applied to HTML 
trees, that might be complex and rich of nodes. 
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There are two main limitations of this algorithm: 
— It can not match permutation of nodes; 

— No level crossing is allowed (i.e., it is not possible to match at different hierarchical 
levels). 

Despite these intrinsic limits, this technique appears to fit very well to the pur- 
pose of matching HTML trees in the context of Web Data Extraction systems. In 
fact, it has been adopted in several scenarios [Kim et al. 2007; Zhai and Liu 2005; 
Zhang et al. 2010; Ferrara and Baumgartner 2011a; 2011b; 2011c]. The reasons 
for its choice are different; for example, the simple tree matching algorithm eval- 
uates similarity between two trees by producing the maximum matching through 
dynamic programming, without computing inserting, relabeling and deleting op- 
erations; moreover, approximate tree edit distance algorithms relies on complex 
implementations to achieve good performance, instead simple tree matching, and 
its variants, are easy to implement. 

A normalized variant of the simple tree matching is called normalized simple tree 
matching, whose result is given by: 



NSTM(A,B) = SirnpleTreeMatMngjAB) 
y ' (nodes(A) +nodes(B))/2 

The normalization is computed considering the average size of nodes in the given 
trees (or sub-trees). 

The weighted tree matching algorithm. Another variant of the simple tree match- 
ing is discussed in the following, and is called weighted tree matching. It adjusts 
the similarity values provided by the original simple tree matching by introducing 
a re-normalization factor. 

The weighted tree matching, whose pseudo-codify is reported in Algorithm 2, has 
been recently presented in [Ferrara and Baumgartner 2011a]. 



Algorithm 2 WcightedTreeMatching(T' , T") 
1: {Change line 11 with the following code} 
2: if to > AND n > then 
3: return M[m][n] * 1 / Max(t(T'), t{T" )) 
4: else 

5: return M[m][n] + 1 / Max(t(T), t{T")) 
6: end if 



In Algorithm 2, the notation t(n) represents the number of total siblings of a 
node n including itself. Note that Algorithm 2 reports the differential version with 
respect to the simple tree matching described in Algorithm 1. The advantage of the 
weighted tree matching is that it better reflects a measure of similarity between two 
trees. In fact, in the simple tree matching algorithm the assigned matching value 
is always equal to one. Instead, the weighted tree matching algorithm assumes 
that less importance (i.e., a lower weight) is assigned to changes in the structure 
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of the tree, when they occur in deeper sub-levels. This kind of changes can be, for 
example, missing or added leaves, truncated or added branches, etc. Also, a lower 
weight is accounted when changes occur in sub-levels with many nodes. 

In the weighted tree matching, the weighted value assigned to a match between 
two nodes is equal to one divided by the greatest number of siblings with respect 
to the two compared nodes, including themselves. This helps reducing the impact 
of missing/added nodes on the final similarity value. 

Moreover, before assigning a weight, the algorithm checks if it is comparing two 
leaves or a leaf with a node which has children (or two nodes which both have 
children). The final contribution of a sub-level of leaves is the sum of assigned 
weighted values to each leaf (cfr. Code Line (4,5)); thus, the contribution of the 
parent node of those leaves is equal to its weighted value multiplied by the sum of 
contributions of its children (cfr. Code Line (2,3)). For each sub- level of leaves, 
the maximum sum of assigned values is 1; thus, for each parent node of that sub- 
level, the maximum value of the multiplication of its contribution with the sum of 
contributions of its children is 1; each sub-tree, individually considered, contributes 
with a maximum value of 1. In the last recursion of this bottom-up algorithm, the 
two roots will be evaluated. The resulting value at the end of the process is the 
measure of similarity between the two trees, expressed in the interval [0,1]. The 
closer to 1 the final value, the more similar the two trees. 

Let us analyze the behavior of the algorithm with an example often used in 
the literature [Yang 1991; Zhai and Liu 2005; Ferrara and Baumgartner 2011a] to 
explain the simple tree matching (see Figure 2). In that figure, A and B are two 
very simple generic rooted labeled trees (i.e., the same structure of HTML document 
trees). They show several similarities except for some missing nodes/branches. 

By applying the weighted tree matching algorithm, in the first step the contri- 
butions assigned to leaves reflect the considerations discussed above: for example, 
a value of | is established for nodes (h), (i) and (j) belonging to A, although two 
of them are missing in B. Going up to parents, the summation of contributions of 
matching leaves is multiplied by the relative value of each node (e.g., in the first 
sub-level, the contribution of each node is \ because of the four first-sublcvcl nodes 
in A). 

Once completed these operations for all nodes of the sub-level, values are added 
and the final measure of similarity for the two trees is obtained. Intuitively, in more 
complex and deeper trees, this process is iteratively executed for all the sub-levels. 
The deeper a mismatch is found, the less its missing contribution will affect the final 
measure of similarity. Analogous considerations hold for missing/ added nodes and 
branches, sub-levels with many nodes, etc. Table I shows the two matrices adopted 
by the weighted tree matching algorithm, M and W that contain the matching and 
the weights at each iteration. 
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In this example, the weighted tree matching between A and B returns a measure 
of similarity of | (0.375) whereas the simple tree matching would return a mapping 
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Fig. 2. Example of application of the weighted tree matching algorithm for the comparison of two 
labeled rooted trees, A and B. 
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Table I. W and M matrices for each matching subtree. 



value of 7. The main difference on results provided by these two algorithms is the 
following: the weighted tree matching intrinsically produces a proper measure of 
similarity between the two compared trees while the simple tree matching returns 
the mapping value. 

A remarkable feature of the weighted tree matching algorithm is that, the more 
the structure of considered trees is complex and similar, the more the measure of 
similarity will be accurate. On the other hand, for simple and quite different trees 
the accuracy of this approach is lower than the one ensured by the simple tree 
matching. 

2.2 Web wrappers 

In the previous section we discussed some algorithms that might be adopted to 
identify information exploiting the semi-structured format of HTML documents. 
In the following we will discuss those procedures that might adopt the techniques 
presented above to carry out the data extraction. 

In the literature, any procedure that aims at extracting structure data from 
unstructured (or semi-structured) data sources is usually referred as wrapper. In 
the context of Web Data Extraction we provide the following definition: 

Definition 2.1 Web wrapper. A procedure, that might implement one or differ- 
ent classes of algorithms, which seeks and finds data required by a human user, 
extracting them from unstructured (or semi-structured) Web sources, and trans- 
forming them into structured data, merging and unifying this information for fur- 
ther processing, in a semi-automatic or fully automatic way. 
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Web wrappers are characterized by a life-cycle constituted by several steps: 

(1) Wrapper generation: the wrapper is defined according to some technique(s); 

(2) Wrapper execution: the wrapper runs and extracts information continuously; 

(3) Wrapper maintenance: the structure of data sources may change and the wrap- 
per should be adapted accordingly to keep working properly. 

In the remainder of this section we will discuss these three different phases. 

In particular, the first two steps of a wrapper life-cycle, its generation and exe- 
cution, are discussed in Section 2.2.1; these steps might be implemented manually, 
for example by defining and executing regular expressions over the HTML docu- 
ments; alternatively, which is the aim of Web Data Extraction systems, wrappers 
might be defined and executed by using an inductive approach - a process com- 
monly known as wrapper induction [Kushmcrick 1997]. Web wrapper induction 
is challenging because it requires high level automation strategies. There exist 
also hybrid approaches that make possible for users to generate and run wrappers 
semi-automatically by means of visual interfaces. 

The last step of a wrapper life-cycle is the maintenance: Web pages change their 
structure continuously and without forewarning. This might corrupt the correct 
functioning of a Web wrapper, whose definition is usually tightly bound to the 
structure of the Web pages adopted for its generation. Defining automatic strate- 
gies for wrapper maintenance is of outstanding importance to guarantee correctness 
of extracted data and robustness of Web Data Extraction platforms. Some method- 
ologies have been recently presented in the literature, and are discussed in Section 
2.2.2. 

2.2.1 Wrapper generation and execution. The first step in wrappers life-cycle 
is their generation. Early Web Data Extraction platforms provided only support 
for manual generation of wrappers, which required human expertise and skills in 
programming languages to write scripts able to identify and extract selected pieces 
of information within a Web page. 

In late nineties they made their appearance more advanced Web Data Extraction 
systems. The core feature provided was the possibility for their users to define and 
execute Web wrappers by means of interactive graphical users interfaces (GUIs). In 
most cases, it was not required any deep understanding of a wrapper programming 
language, as wrappers were generated automatically (or semi-automatically) by the 
system exploiting directives given by users by means of the platform interface. 

In the following we discuss in detail three types of rationales underlying these 
kind of platforms, namely regular expressions, wrapper programming languages and 
tree-based approaches. 

The details regarding the features and of different Web Data Extraction plat- 
forms, instead, will be described in great detail in Section 3. 

Regular-expression-based approach. One of the most common approaches is based 
on regular expressions, which are a powerful formal language used to identify strings 
or patterns in unstructured text on the basis of some matching criteria. Rules could 
be complex so, writing them manually, could require much time and a great exper- 
tise: wrappers based on regular expressions dynamically generate rules to extract 
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desired data from Web pages. Usually, writing regular expressions on HTML pages 
relies on the following criteria: word boundaries, HTML tags, tables structure, 
etc. The advantage of platforms relying on regular expressions is that the user can 
usually select (for example by means of a graphical interface) one (or multiple) 
element(s) in a Web page, and the system is able to automatically infer the ap- 
propriate regular expression to identify that element in the page. Then, a wrapper 
might be created so that to extract similar elements, from other Web pages with 
the same structure of the one adopted to infer the regular expressions. 

A notable tool implementing regular-expression-based extraction is W4F [Sahuguet 
and Azavant 1999]. W4F adopts an annotation approach: instead of challenging 
users to deal with the HTML documents syntax, W4F eases the design of the wrap- 
per by means of a wizard procedure. This wizard allows users to select and annotate 
elements directly on the Web page. W4F produces the regular expression extrac- 
tion rules of the annotated items and provides them to users. A further step, which 
is the optimization of the regular expressions generated by W4F, is delegated to 
expert users - in fact, the tool is not always able to provide the best extraction 
rule. By fully exploiting the power of regular expressions, W4F extraction rules in- 
clude match and also split expressions, which separates words, annotating different 
elements on the same string. The drawback of the adoption of regular expressions 
is their lack of flexibility. For example, whenever even a minor change occurs in the 
structure or content of a Web page, each regular expression is very likely to stop 
working, and must be rewritten. This process implies a big commitment by human 
users, in particular in the maintenance of systems based on regular expressions. For 
this reasons more flexible and powerful languages have been developed to empower 
the capabilities of Web Data Extraction platforms. 

Logic-based approach. One example of powerful languages developed for data 
extraction purposes comes from the Web specific wrapper programming languages. 
Tools based on wrapper programming languages consider Web pages not as simply 
text strings but as semi-structured tree documents, whereas the DOM of the Web 
page represents its structure where nodes are elements characterized by both their 
properties and their content. The advantage of such an approach is that wrapper 
programming languages might be defined to fully exploit both the semi-structured 
nature of the Web pages and their contents - the former aspect lacks in regular- 
cxpression-based systems. 

The first powerful wrapping language has been formalized by Gottlob and Koch 
[Gottlob and Koch 2004a]. The information extraction functions implemented by 
this wrapping language rely on monadic datalogs over trees [Gottlob and Koch 
2004b]. The authors demonstrated that monadic datalogs over tree are equivalent 
to monadic second order logic (MSO), and hence very expressive. However, unlike 
MSO, a wrapper in monadic datalogs can be modeled nicely in a visual and inter- 
active step-by-step manner. This makes this wrappring language suitable for being 
incorporated into visual tools, satisfying the condition that all its constructs can 
be implemented through corresponding visual primitives. 

A bit of flavor on the functioning of the wrapping language is provided in the 
following. Starting from the unranked labeled tree representing the DOM of the 
Web page, the algorithm re-labels nodes, truncates the irrelevant ones, and finally 
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returns a subset of original tree nodes, representing the selected data extracted. 
The first implementation of this wrapping language in a real-world scenarios is due 
to Baumgartner et al. [Baumgartner ct al. 2001b; 2001a]. They developed the 
Elog wrapping language that implements almost all monadic datalog information 
extraction functions, with some minor restrictions. The Elog language is used as the 
core extraction method of the Lixto Visual Wrapper system. This platform provides 
a GUI to select, through visual specification, patterns in Web pages, in hierarchical 
order, highlighting elements of the document and specifying relationships among 
them. Information identified in this way could be too general, thus the system allows 
users to add some restricting conditions, for example before/after, not-before/not- 
after, internal and range conditions. Finally, selected data are translated into XML 
documents, by using pattern names as XML element names, obtaining structured 
data from semi-structured Web pages. 

Tree-based approach [partial tree alignment]. The last technique discussed in this 
part relates to wrapper generation and is called partial tree alignment. It has been 
recently formalized by Zhai and Liu [Zhai and Liu 2005; 2006] and the authors also 
developed a Web Data Extraction system based on it. This technique relies on the 
idea that information in Web documents usually are collected in contiguous regions 
of the page, called data record regions. The strategy of partial tree alignment 
consists in identifying and extracting these regions. In particular, the authors take 
inspiration from tree matching algorithms, by using the already discussed tree edit 
distance matching (see Section 2.1.2). The algorithm works in two steps: 

(1) Segmentation; 

(2) Partial tree alignment. 

In the first phase, the Web page is split in segments, without extracting any data. 
This pre-processing phase is instrumental to the latter step. In fact, the system 
not only performs an analysis of the Web page document based on the DOM tree, 
but also relics on visual cues (like in the spatial reasoning technique, see Section 
2.3.1), trying to identify gaps between data records. This step is useful also because 
helps the process of extracting structural information from the HTML document, 
in that situations when the HTML syntax is abused, for example by using tabular 
structure instead of CSS to arrange the graphical aspect of the page. 

In the second step, the partial tree alignment algorithm is applied to data records 
earlier identified. Each data record is extracted from its DOM sub-tree position, 
constituting the root of a new single tree. This, because each data record could be 
contained in more than one non-contiguous sub-tree in the original DOM tree. The 
partial tree alignment approach implies the alignment of data fields with certainty, 
excluding those that can not be aligned, to ensure a high degree of precision. During 
this process no data items are involved, because partial tree alignment works only 
on tree tags matching, represented as the minimum cost, in terms of operations 
(i.e., node removal, node insertion, node replacement), to transform one node into 
another one. The drawback of this characteristic of the algorithm is that its recall 
performance (i.e., the ability of recovering all expected information) might decay 
in case of complex HTML document structures. In addition, also in the case of the 
partial tree alignment, the functioning of this strategy is strictly related with the 
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structure of the Web page at the time of the definition of the alignment. This implies 
that the method is very sensitive even to small changes, that might compromise 
the functioning of the algorithm and the correct extraction of information. Thus, 
even in this approach, the problem of the maintenance arises with outstanding 
importance. 

2.2.2 The problem of wrapper maintenance. Wrapper generation, regardless the 
adopted technique, is one aspect of the problem of data extraction from Web 
sources. On the other hand, wrapper maintenance is equally important, so that 
Web Data Extraction platform may reach high levels of robustness and reliability, 
hand in hand with the level of automation and low level of human engagement. In 
fact, differently from static documents, Web pages dynamically change and evolve, 
and their structure may change, sometimes with the consequence that previously 
defined wrappers are no longer able to successfully extract data. 

In the light of these assumptions, one could argument that wrapper maintenance 
is a critical step of the Web Data Extraction process. Even though, this aspect 
has not acquired lot of attention in the literature (much less than the problem 
of wrapper generation), unless latest years. In the early stage, in fact, wrapper 
maintenance was performed manually: users that usually design Web wrappers, 
were updating or rewriting these wrappers every time the structure of a given Web 
page was modified. The manual maintenance approach was fitting well for small 
problems, but becomes unsuitable if the pool of Web pages largely increases. Since 
in the enterprise: scenarios regular data extraction tasks might involve thousand (or 
even more) Web pages, dynamically generated and frequently updated, the manual 
wrapper maintenance is not anymore a feasible solution for real-world applications. 

For these reasons, the problem of automatizing the wrapper maintenance has 
been faced in recent literature. For example, the first effort in the direction of 
automatic wrapper maintenance has been presented by Kushmerick [Kushmerick 
2002] , who defined for first the concept of wrapper verification. The task of wrapper 
verification arises as a required step during wrapper execution, in which a Web Data 
Extraction system assess if defined Web wrappers work properly or, alternatively, 
their functioning is corrupted due to modifications to the structure of underly- 
ing pages. Subsequently, the author discussed some techniques of semi-automatic 
wrapper maintenance, to handle simple problems. 

The first method that tries to automatize the process of wrapper maintenance 
has been developed by Meng et al. [Mcng ct al. 2003], and it is called schema- 
guided wrapper maintenance. It relies on the definition of XML schemes during the 
phase of the wrapper generation, to be exploited for the maintenance during the 
execution step. More recently, Ferrara and Baumgartner [Ferrara and Baumgartncr 
2011a; 2011b; 2011c] developed a system of automatic wrapper adaptation (a kind 
of maintenance that occur to modify Web wrappers according to the new structure 
of the Web pages) relying on the analysis of structural similarities between different 
versions of the same Web page using a tree-edit algorithm. In the following, we 
discuss these two strategies of wrapper maintenance. 

Schema-guided wrapper maintenance. The first attempt to deal with the problem 
of wrapper maintenance providing a high level of automation has been presented 
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by Meng et al. [Meng et al. 2003]. The authors developed SG-WRAM (Schema- 
Guided WRApper Maintenance), a strategy for Web Data Extraction that is built 
on top of the assumption, based on empirical observations, that changes in Web 
pages, even substantial, often preserve: 

— Syntactic features: syntactic characteristics of data items like data patterns, 

string lengths, etc., are mostly preserved. 
— Hyperlinks: HTML documents are often enriched by hyperlinks that are seldom 

removed in subsequent modifications of the Web pages. 
— Annotations: descriptive information representing the semantic meaning of a 

piece of information in its context is usually maintained. 

In the light of these assumptions, the authors developed a Web Data Extraction 
system that, during the phase of wrapper generation, creates schemes which will be 
exploited during the phase of wrapper maintenance. In detail, during the generation 
of wrappers, the user provides HTML documents and XML schemes, specifying a 
mapping between them. Later, the system will generate extraction rules and then 
it will execute the wrappers to extract data, building an XML document according 
to the specified XML schema. During the wrapper execution phase, an additional 
component is introduced in the pipeline of the data extraction process: the wrapper 
maintainer. The wrapper maintainor checks for any potential extraction issue and 
provides an automatic repairing protocol for wrappers which fail their extraction 
task because of modifications in the structure of related Web pages. The repairing 
protocol might be successful, and in that case the data extraction continues, or 
it might fail - in that case warnings and notifications rise. The XML schemes 
are defined in the format of a DTD (Document Type Definition) and the HTML 
documents are represented as DOM trees, according to what explained in Section 
2.1. The SG-WRAM system builds corresponding mappings between them and 
generates extraction rules in the format of XQuery expressions 4 . 

Automatic wrapper adaptation. Another strategy for the automatic maintenance 
of Web wrappers has been recently presented by Ferrara and Baumgartner [Ferrara 
and Baumgartner 2011a; 2011b; 2011c]. In detail, it is a method of automatic wrap- 
per adaptation that relies on the idea of comparing helpful structural information 
stored in the Web wrapper defined on the original version of the Web page, search- 
ing for similarities in the new version of the page, after any structural modification 
occurs. 

The strategy works for different techniques of data extraction implemented by 
the wrapping system. For example, it has been tested by using both XPath (see 
Section 2.1.1) and the Elog wrapping language (see Section 2.2.1). In this strategy, 
elements are identified and represented as sub-trees of the DOM tree of the Web 
page, and can be exploited to find similarities between two different versions of the 
same document. 

We discuss an example adopting XPath to address a single element in a Web 
page, as reported in the example in Figure 1(A). The rationale behind the automatic 
wrapper adaptation is to search for some elements, in the modified version of the 



4 For XQuery specifications see: http://www.w3.org/TR/xquery/ 
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Web page, that share structural similarities with the original one. The evaluation of 
the similarity is done on the basis of comparable features (e.g., subtrees, attributes, 
etc.). These elements are called candidates: among them, the one showing the 
higher degree of similarity with the element in the original page, is matched in the 
new version of the page. The algorithm adopted to compute the matching among 
the DOM trees of the two HTML documents is the weighted tree matching, already 
presented in Section 2.1.2. Further heuristics are adopted to assess the similarity 
of nodes, for example exploiting additional features exhibited by DOM elements. 
In some cases, for example, their textual contents might be compared, according to 
string distance measures such as Jaro- Winkler [Winkler 1999] or bigrams [Collins 
1996], to take into account also the content similarity of two given nodes. 

It is possible to extend the same approach in the case in which the XPath identifies 
multiple similar elements on the original page (e.g., an XPath selecting results of 
a search in a retail online shop, represented as table rows, divs or list items), 
as reported in Figure 1(B). In detail, it is possible to identify multiple elements 
sharing a similar structure in the new page, within a custom level of accuracy (e.g., 
establishing a threshold value of similarity). 

The authors implemented this approach in a commercial tool - Lixto, describ- 
ing how the pipeline of the wrapper generation has been modified to allow the 
wrappers to automatically detect and adapt their functioning to structural modifi- 
cations occurring to Web pages [Ferrara and Baumgartner 2011c]. In this context, 
the strategy proposed was to acquire structural information about those elements 
the original wrapper extracts, storing them directly inside the wrapper itself. This 
is done, for example, generating signatures representing the DOM sub-tree of ex- 
tracted elements from the original Web page, stored as a tree diagram, or as a 
simple XML documents. 

During the execution of Web wrappers, if any extraction issue occurs due to a 
structural modification of the page, the wrapper adaptation algorithm automati- 
cally starts and tries to adapt the wrapper to the new structure. 

2.3 Machine Learning approaches 

Not only the literature in the context of Artificial Intelligence and Algorithms fo- 
cused on the problem Web Data Extraction, but also the field of Machine Learning 
has proposed several interesting solutions. In particular, Machine Learning tech- 
niques fit well to the purpose of extracting domain-specific information from Web 
sources, since they rely on training sessions during which a system acquires a do- 
main expertise. On the other hand, training learning-based Web Data Extraction 
systems usually requires large amount of manually labeled Web pages, that implies 
a bootstrap period in which the systems require a high level of human engagement. 
Moreover, during the manual labeling phase, domain experts should provide both 
positive and negative examples, acquired from different websites but also in the 
same website. Particular attention should be paid to providing examples of Web 
pages belonging to the same domain but exhibiting different structures. This, be- 
cause, even in the same domain scenario, templates usually adopted to generate 
dynamic contents Web pages, differ, and the system should be capable of learn- 
ing how to extract information in these contexts. Statistical Machine Learning 
systems were also developed, relying on conditional models [Phan et al. 2005] or 
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adaptive search [Turmo et al. 2006] as an alternative solution to human knowledge 
and interaction. 

WIEN. In the early literature, some interesting wrapper induction systems have 
been presented. In fact, Kushmerick developed the first wrapper induction system, 
called WIEN [Kushmerick 2000], based on different inductive learning techniques. 
Interestingly, WIEN was capable of automatically labeling training pages, repre- 
senting de facto a hybrid system whose training process was accelerated and implied 
low human engagement. The flip side of the high automation of WIEN was the 
big number of limitations related to its inferential system: for example, the data 
extraction process was not capable of dealing with missing values - a case that oc- 
curs on a frequent base and posed serious limitations on the adaptability of WIEN 
to real- world scenarios. 

SoftMealy. Another learning-based Web Data Extraction platform, called Soft- 
Mealy, has been presented by Hsu and Dung [Hsu and Dung 1998]. SoftMealy was 
the first wrapper induction system specifically designed to work in the Web Data 
Extraction context. Relying on non- deterministic finite state automata (also known 
as finite-state transducers (FST)), SoftMealy uses a bottom- up inductive learning 
approach to learn extraction rules. During the training session the system acquires 
training pages represented as an automaton on all the possible permutations of Web 
pages: states represent extracted data, while state transitions represent extraction 
rules. SoftMealy's main strength was its novel method of internal representation 
of the HTML documents. In detail, during a pre-processing step, each considered 
Web page was encoded into tokens (defined according to a set of inferential rules). 
Then, tokens were exploited to define separators, considered as invisible borderlines 
between two consecutive tokens. Finally, the FST was fed by sequence of separators, 
instead of raw HTML strings (as in WIEN), so that to match tokens with contex- 
tual rules (defined to characterize a set of individual separators) to determine the 
state transitions. 

The advantages of SoftMealy with respect to WIEN are worth noting: in fact, 
the system was able to deal with a number of exception, such as missing val- 
ues/attributes, multiple attribute values, variant attribute permutations and also 
with typos. 

STALKER. The last learning-based system discussed in this part is called STALKER 
[Muslea et al. 1999]. It was a supervised learning system for wrapper induction 
sharing some similarities with SoftMealy. The main difference between these two 
systems is the specification of relevant data: in STALKER, a set of tokens is manu- 
ally positioned on the Web page, so that to identify information that the user intend 
to extract. This aspect ensures the capability of STALKER of handling with empty 
values, hierarchical structures and non ordered items. This system models a Web 
page content by means of hierarchical relations, represented by using a tree data 
structure called embedded catalog tree (EC tree). The root of the EC tree is popu- 
lated the sequence of all tokens (whereas, STALKER considers as token any piece 
of text or HTML tag in the document). Each child node is a sub-sequence of tokens 
inherited by its parent node. This implies that each parent node is a super-sequence 
of tokens of its children. The super-sequence is used, at each level of the hierarchy, 
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to keep track of the content in the sub-levels of the EC tree. The extraction of 
elements of interest for the user is achieved by inferring a set of extraction rules on 
the EC tree itself - a typical example of extraction rule inferred by STALKER is 
the construct SkipTo(T), a directive that indicates, during the extraction phase, 
to skip all tokens until the first occurrence of the token T is found. The inference of 
extraction rules exploits the concept of landmarks, sequences of consecutive tokens 
adopted to locate the beginning and the end of a given item to extract. STALKER 
is also able to define wildcards, classes of generic tokens that are inclusive of more 
specific tokens. 

2.3.1 Hybrid systems: learning-based wrapper generation. Wrapper generation 
systems discussed in Section 2.2 and wrapper induction techniques discussed in 
Section 2.3 differ essentially for two aspects: 

(1) The degree of automation of the Web Data Extraction systems; 

(2) The amount and the type of human engagement required for the functioning. 

The first point is related to the ability of the system to work in an autonomous 
way, ensuring sufficient standards of robustness and reliability, according to the 
requirements of users. Regarding the second point, most of the wrapper induction 
systems need labeled examples provided during the training sessions, thus requiring 
human expert engagement for the manual labeling phase. Wrapper generation 
systems, on the other hand, engage users into their maintenance, unless automatic 
techniques are employed, as those discussed in Section 2.2.2. 

Interestingly, a new class of platforms has been discussed in recent literature, 
that adopts a hybrid approach that sits between learning-based wrapper induction 
systems and wrapper generation platforms. The first example of this class of sys- 
tems is given by RoadRunner [Crescenzi et al. 2001; Crescenzi and Mecca 2004], a 
template-based system that automatically generates templates to extract data by 
matching features from different pages in the same domain. Another interesting 
approach is that of exploiting visual cues and spatial reasoning to identify elements 
in the Web pages with a Computer Vision oriented paradigm. This part concludes 
with the discussion of these two systems. 

Template-based matching. The first example of hybrid system is provided by 
RoadRunner [Crescenzi et al. 2001; Crescenzi and Mecca 2004]. This system might 
be considered as an interesting example of automatic wrapper generator. The main 
strength of RoadRunner is that it is oriented to data-intensive websites based on 
templates or regular structures. The system tackles the problem of data extraction 
exploiting both features used by wrapper generators, and by wrapper induction 
systems. In particular, RoadRunner can work using information provided by users, 
in the form of labeled example pages, or also by automatically labeling Web pages 
(such as WIEN), to build a training set. In addition, it might exploits a priori 
knowledge on the schema of the Web pages, for example taking into account previ- 
ously learned page templates. 

RoadRunner relies on the idea of working with two HTML pages at the same 
time in order to discover patterns analyzing similarities and differences between 
structure and content of each pair of pages. Essentially, RoadRunner can extract 
relevant information from any Web site containing at least two Web pages with a 
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similar structure. Since usually Web pages are dynamically generated starting from 
template, and relevant data are positioned in the same (or in similar) areas of the 
page, RoadRunner is able to exploit this characteristic to identify relevant pieces 
of information, and, at the same time, taking into account small differences due to 
missing values or other mismatches. 

The authors defined as class of pages those Web sources characterized by a com- 
mon generation script. Then, the problem is reduced to extracting relevant data 
by generating wrappers for class of pages, starting from the inference of a common 
structure from the two-page-based comparison. This system can handle missing 
and optional values and also structural differences, adapting very well to all kinds 
of data-intensive Web sources. Another strength of RoadRunner is its high-quality 
open-source implementation 5 , that provides a high degree of reliability of the ex- 
traction system. 

Spatial reasoning. The paradigm of the Computer Vision has inspired also the 
field of Web Data Extraction systems. In fact, a recent model of data extraction, 
called Visual Box Model, has been presented [Krupl et al. 2005; Gatterbauer and 
Bohunsky 2006]. The Visual Box Model exploits visual cues to understand if, in 
the version of the Web page displayed on the screen, after the rendering of the Web 
browser, are present, for example, data in a tabular format. The advantage of this 
strategy is that it is possible to acquire data not necessarily represented by means 
of the standard HTML <table> format. 

The functioning of this technique is based on a X-Y cut OCR algorithm. This 
algorithm is able, given the rendered version of a Web page, of generating a vi- 
sual grid, where elements of the page are allocated according to their coordinates 
- determined by visual cues. Cuts are recursively applied to the bitmap image 
representing the rendering of the Web page, and stored into an X-Y tree. This tree 
is built so that ancestor nodes with leaves represent not-empty tables. Some addi- 
tional operations check whether extracted tables contain useful information. This 
is done because - although it is a deprecated practice - many Web pages use tables 
for structural and graphical purposes, instead of for data representation scopes. 

The Visual Box Model data extraction system is implemented by means of an 
internal rendering engine that produces a visualization of the Web page relying on 
Gecko 6 , the same rendering engine used by the Mozilla Web browser. By exploiting 
the CSS 2.0 box model, the algorithm is able to access the positional information 
of any given element. This is achieved by a bridge between the rendering engine 
and the application, implemented by means of the XPCOM library 7 . 

3. WEB DATA EXTRACTION SYSTEMS 

In this section we get into details regarding the characteristics of existing Web Data 
Extraction systems. 

We can generically define a Web Data Extraction system as a platform imple- 
menting a sequence of procedures (for example, Web wrappers) that extract infor- 



5 http: / / www.dia.uniroma3.it /db/ roadRunner/ 
6 https://developer. mozilla.org/en/Gecko 
7 https://developer. mozilla.org/it/XPCOM 
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mation from Web sources [Laender et al. 2002]. From this generic definition, we 
can infer two fundamental aspects of the problem: 

— Interaction with Web pages. The first phase of a generic Web Data Extraction sys- 
tem is the Web interaction [Wang et al. 2000]: the Web Data Extraction system 
accesses a Web source and extracts information stored in it. Web sources usually 
coincide with Web pages, but some approaches consider also as RSS/Atom feeds 
[Hammcrsley 2005] and Microformats [Khare and Celik 2006]. 
Some commercial systems, Lixto for first but also Kapow Mashup Server (de- 
scribed below) , include a Graphical User Interface for fully visual and interactive 
navigation of HTML pages, integrated with data extraction tools. 
The most advanced Web Data Extraction systems support the extraction of data 
from pages reached by deep Web navigation [Baumgartner et al. 2005]: they 
simulate the activity of users clicking on DOM elements of pages, through macros 
or, more simply, by filling HTML forms. 

These systems also support the extraction of information from dynamically gen- 
erated Web pages, usually built at run-time as a consequence of the user request, 
filling a template page with data from some database. The other kind of pages 
are commonly called static Web pages, because of their static content. 
OxPath [Furche et al. 2011], which is part of the DIADEM project [Furche et al. 
2012] is a declarative formalism that extends XPath to support deep Web naviga- 
tion and data extraction from interactive Web sites. It adds new location steps to 
simulate user actions, a new axis to select dynamically computed CSS attributes, 
a convenient field function to identify visible fields only, page iteration with a 
Kleene-star extension for repeated navigations, and new predicates for marking 
expressions for extraction identification. 
— Generation of a wrapper. A Web Data Extraction system must implement the 
support for wrapper generation and wrapper execution. 

Another definition of Web Data Extraction system was provided by Baumgartner 
et al. [Baumgartner et al. 2009]. They defined a Web Data Extraction system as 
"a software extracting, automatically and repeatedly, data from Web pages with 
changing contents, and that delivers extracted data to a database or some other 
application" . 

This is the definition that better fits the modern view of the problem of the Web 
Data Extraction as it introduces three important aspects: 

— Automation and scheduling 
— Data transformation, and the 
— Use of the extracted data 

In the following we shall discuss each of these aspects into detail. 

Automation and Extraction. Automating the access to Web pages as well as the 
localization of their elements is one of the most important features included in last 
Web Data Extraction systems [Phan et al. 2005]. The capability to create macros to 
execute multiple instances of the same task, including the possibility to simulate the 
click stream of the user, filling forms and selecting menus and buttons, the support 
for AJAX technology [Garrett 2005] to handle the asynchronous updating of the 
page, etc. are only some of the most important automation features. 
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Also the scheduling is important, for example if a user wants to extract data 
from a news Web site updated every 5 minutes, many of the last tools let him 
to setup a scheduler, working like a cron, launching macros and executing scripts 
automatically and periodically. 

Data transformation. Information could be wrapped from multiple sources, which 
means using different wrappers and also, probably, obtaining different structures 
of extracted data. The steps between extraction and delivering are called data 
transformation: during these phases, such as data cleaning [Rahm and Do 2000] 
and conflict resolution [Mongc 2000] , users reach the target to obtain homogeneous 
information under a unique resulting structure. The most powerful Web Data Ex- 
traction systems provide tools to perform automatic schema matching from multiple 
wrappers [Rahm and Bernstein 2001], then packaging data into a desired format 
(e.g., a database, XML, etc.) to make it possible to query data, normalize structure 
and de-duplicate tuples. 

Use of extracted data. When the extraction task is complete, and acquired data 
are packaged in the needed format, this information is ready to be used; the last 
step is to deliver the package, now represented by structured data, to a managing 
system (e.g., a native XML DBMS, a RDBMS, a data warehouse, a CMS, etc.). In 
addition to all the specific fields of application covered later in this work, acquired 
data can be also generically used for analytical or statistical purposes [Berthold 
and Hand 1999] or simply to republish them under a structured format. 

3.1 Layer cake comparisons 

In this section, we summarize the capability stacks of Web data extraction systems 
from our understanding, including aspects of wrapper generation, data extraction 
capabilities, and wrapper usage. In particular, we introduce some specific aspects 
and illustrate the technological evolution of Web Data Extraction systems according 
to each aspect. We use ad hoc diagrams structured as a group of layers (layer 
cakes). In these diagrams, bottom (resp., top) layers correspond to the earliest 
(resp., latest) technological solutions. 

- Wrapper Generation: Ease of Use (Figure 3). The first approaches to consuming 
facts from the Web were implemented by means of general purpose languages. 
Over time libraries (e.g. Ruby Mechanize) and special-purpose query languages 
evolved on top of this principle (e.g., Jedi [Huck et al. 1998] and Florid [May 
et al. 1999]). Wizards that simplify the way to specify queries are the next logical 
level and for instance have been used in W4F [Sahuguet and Azavant 1999] and 
XWrap [Liu et al. 2000]. Advanced Web Data Extraction systems offer GUIs 
for configuration, either client-based (e.g. Lapis), Web-based (e.g. Dapper and 
Needlebase) or as browser extensions (e.g. iOpus and Chickenfoot). Commercial 
frameworks offer a full IDE with functionalities described in the previous section 
(e.g. Denodo, Kapowtech, Lixto and Mozenda). 

— Wrapper Generation: Creation Paradigma (Figure 4). From the perspective of 
how the system supports the wrapper designer to create robust extraction pro- 
grams, the simplest approach is to manually specify queries and test them against 
sample sites individually. Advanced editors offer highlighting of query keywords 
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Fig. 3. Layer Cake: Wrapper Generation: Ease of Use 



and operators and assist query writing with auto-completion and similar usability 
enhancements (e.g., Screen-Scraper). In case of procedural languages, debugging 
means and visual assistance for constructs such as loops are further means to 
guided wrapper generation. Regarding the definition of deep Web navigations, 
such as form fillouts, a number of tools offer VCR-style recording the human 
browsing and replaying the recorded navigation sequence (e.g., Chickenfoot, iO- 
pus, Lixto). Visual and interactive facilities are offered by systems to simplify 
fact extraction. Users mark an example instance, and the system identifies the 
selected element in a robust way, and possibly generalize to match further sim- 
ilar elements (e.g., to find all book titles). Such means are often equipped with 
Machine Learning methods, in which the user can select a multitude of positive 
and negative examples, and the system generates a grammar to identify the Web 
objects under consideration, often in an iterative fashion (e.g. Wien, Dapper, 
Needlebase). Click and drag and drop features further simplify the interactive 
and visual mechanisms. Finally, some systems offer vertical templates to easily 
create wrappers for particular domains, for example for extracting hotel data or 
news item, using Natural Language Processing techniques and domain knowl- 
edge, or for extracting data from typical Web layouts, such as table structures 
or overview pages with next links. 

— Deep Web Navigation Capabilities (Figure 5). Before the advent of Web 2.0 
techniques, dynamic HTML and AJAX it was usually sufficient to consider the 
Web as a collection of linked pages. In such cases, form filling can be simulated 
by tracking the requests and responses from the Web Server, and replaying the 
sequence of requests (sometimes populating a session id dynamically, extracted 
from a previous page of the sequence) . Alternatively, early Web Data extraction 
systems have been influenced by screen scraping technologies like they were being 
used for automating 3270 applications or like used for automating native applica- 
tions, usually relying heavily on coordinates. Understanding and replaying DOM 
Events on Web objects is the next logical level in this capability stack. Advanced 
systems even go a step further, especially when embedding a full browser: the 
click on an element is recorded in a robust way and during replay the browser 
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Fig. 4. Layer Cake: Wrapper Generation: Creation Paradigma 



is informed to do a visual click on such an element, handing the DOM handling 
over to the browser and making sure that the Web page is consumed exactly in 
the way the human user consumes it. Orthogonal to such features are capabili- 
ties to parametrize deep Web sequences, and to use query probing techniques to 
automate deep Web navigation to unknown forms. 
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Fig. 5. Layer Cake: Deep Web Navigation Capabilities 



— Web Data Extraction Capabilities (Figure 6). Over time, various approaches to 
modeling a Web page have been discussed. The simplest way is the to work on 
the stream received from the Web server, for example using regular expressions. 
In some cases, this is sufficient and even the preferred approach in large-scale 
scenarios due to avoiding to build up a complex and unperformant model. On 
the other hand, in complex Web 2.0 pages or in pages that are not well-formed, it 
can be extremely cumbersome to work on the textual level only. Moreover, such 
wrappers are not very maintainable and break frequently. The most common 
approach is to work on the DOM tree or some other kind of tree structure. This 
approach has been followed both by the academic community and by commercial 
approaches. In the academic communities studying expressiveness of language 
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over trees and tree automata are interesting, and from a commercial point of view, 
it is convenient and robust to use languages such as XPath for identifying Web 
objects. Usually, not only the elements of a DOM tree are considered, but also 
the events, enabling to specify data extraction and navigation steps with the same 
approach. In the Web of today, however, very often the DOM tree does not really 
capture the essential structure of a Web page as presented to the human user in 
a browser. A human perceives something as table structure, whereas the DOM 
tree contains a list of div elements with absolute positioning. Moreover, binary 
objects embedded into Web pages such as Flash pose further challenges and are 
not covered with a tree structure approach. Hence, screen-scraping made back its 
way into novel Web Data extraction frameworks, using methods from document 
understanding and spatial reasoning such as the approaches of the TamCrow 
project [Kriipl-Sypien et al. 2011], of the ABBA project [Fayzrakhmanov et al. 
2010] spatial XPath extensions [Oro et al. 2010] and rendition-based extensions 
in RoadRunner to detect labels [Crescenzi et al. 2004]. 
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Fig. 6. Layer Cake: Web Data Extraction Capabilities 



— Parser and Browser Embedding (Figure 7) . This capability stack is closely related 
to the Deep Web capabilities but focussing on the technical realization of parsing 
and browser embedding. Simple approaches create their own parser to identify 
relevant HTML tags, more sophisticated approaches use DOM libraries without 
an associated browser view. Due to the fact that many Web Data Extraction 
frameworks are implemented in Java, special-purpose browsers such as the Java 
Swing browser and the ICE browser are and have been used. The most powerful 
approaches are the ones that embed a standard browser such as IE, Firefox or 
WebKit-based browsers. In case of Java implementations, interfaces such as 
Java-XPCOM bridge or libraries such as JRex are used to embed the Mozilla 
browser. Embedding a full browser not only gives access to the DOM model, 
but additionally to other models useful for data extraction, including the CSS 
Box model. Some tools go a different direction and instead of embedding a 
browser, they are implemented as browser extension, imposing some restrictions 
and inconveniences. Orthogonal to the browser stack are the capabilities to 

ACM Computing Surveys, Vol. V, No. N, July 2012. 



Web Data Extraction, Applications and Techniques: A Survey • 27 



extend extraction functionalities to unstructured text parsing exploiting Natural 
Language Processing techniques. 
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Fig. 7. Layer Cake: Parser and Browser Embedding 



— Complexity of Supported Operations (Figure 8). Simple data extraction tools of- 
fer Web notification, e.g. if a particular word is mentioned. Web macro recorders 
allow users to create "deep" bookmarks, and Web clipping frameworks clip frag- 
ments of a Web page to the desktop. Personalization of a Web page (for instance, 
some tools offer to alter the CSS styles to make frequently used links more promi- 
nent on a page) is a further level in the layer cake. Batch processing frameworks 
offer functionalities to replay ample extraction tasks (e.g., running through many 
different values in form fillout). Advanced systems go a step further and use so- 
phisticated extraction plans and anonymization techniques that ensure to stay 
under the radar and not harm target portals with too many requests at once, and 
deliver the data into further applications such as market intelligence platforms. 
The expressiveness of a wrapper language contributes as well to the complexity 
of supported operations. 
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4. APPLICATIONS 

In the literature of the Web Data Extraction discipline, many works cover ap- 
proaches and techniques adopted to solve some particular problems related to a 
single or, sometimes, a couple of fields of application. 

The aim of the second part of this paper is to survey and analyze a large number 
of applications that are strictly interconnected with Web Data Extraction tasks. 
To the best of our knowledge, this is the first attempt to classify applications based 
on Web Data Extraction techniques even if they have been originally designed to 
operate in specific domain and, in some cases, they can appear as unrelated. 

The spectrum of applications possibly benefiting from Web Data Extraction tech- 
niques is quite large and it encompasses applications designed to work in the busi- 
ness domain to applications developed in the context of the Social Web. We clas- 
sified applications of Web Data Extraction techniques in two main categories: 

— Enterprise Applications: applications falling in this category are mainly con- 
ceived with a commercial goal and they often aim at increasing both the level of 
automation of a business process and its efficiency. 

— Social Web Applications: applications falling in this category are mainly designed 
to extract and collect data from a Social Web platform (e.g., a Social Networking 
Website like Facebook or a resource sharing systems like Flickr). 

The classification scheme above, however, has not to be intended as hard, i.e., 
there are some applications which have been designed to work in the enterprise 
context but they share some features with applications working on Social Web 
platforms. For instance, several firms provide through their customer care services 
tools like discussion forums allowing a customer to directly communicate with the 
firm or to get in touch and share opinions/comments with other customers. Mes- 
sages exchanged in a discussion forum can be extracted and analyzed to identify 
what are the trending discussion topics as well as to measure the level of satisfaction 
perceived by the customers. This information is valuable because it enables firm 
managers to design strategies to increase the quality of services/products delivered 
to customers. Of course, applications working on data produced in discussion fo- 
rums have also a strong social dimension because they deal with data generated by 
users and some of this information derive often from social interactions (think of the 
volume of information generated when a user poses a question to the community 
and other forum members answer her questions). 

In addition to classifying applications into the enterprise and Social Web cate- 
gories, we can introduce other criteria which rely on the fact that data are extracted 
from a single or multiple sources, data can be of the same format or, finally, are 
associated with the same domain. In particular, we introduce the following criteria: 

— Single Source vs. Multiple Sources. Web Data Extraction techniques can be 
applied on data residing on a single platform or, vice versa, they can collect data 
located in different platforms. 

As an example, in the context of Enterprise applications, some applications fetch 
data from a single platform: a relevant example is provided by applications to 
manage customer care activities. As previously pointed out, these application 
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crawls a corpus of text documents produced within a single platform. By con- 
trast, in some cases, the application can benefit of data scattered across multiple 
system. In the context of Enterprise applications, a notable example of applica- 
tion gathering data from multiple sources is given by Comparison Shopping: in 
such a case, a Web Data Extraction technique is able to collect data associated 
with a target product from a multitude of e-commerce platforms and to compare, 
for instance, the price and the shipping details of that product in each of these 
platform. 

The classification above is particularly interesting in the field of Social Web appli- 
cations: in fact we discuss separately applications designed to crawl and collect 
data from a single platform (see Section 4.2.1) from applications running on 
multiple Social Web platforms. 

— Homogeneity in Format vs. Heterogeneity in Format. The second classifica- 
tion criterium we introduce answers the following question: does the application 
collect data of the same format or, vice versa, can it collect data of different 
formats? On the basis of such a criterium we can classify existing applications 
in homogeneous-format and heterogeneous-format. 

As for enterprise applications, a nice example of homogeneous-format applica- 
tions is given by Opinion Sharing Applications. This term identifies the ap- 
plications devoted to collect the opinions a user expresses about a given ser- 
vice/product. These opinions are typically represented by short textual reviews 
or blog post. In such a case, the type of data extracted by a Web Data Extrac- 
tion technique is a string. In some cases, users are allowed to provide scores and 
the format of extracted data is a discrete value, generally ranging in an interval 
like [0,5]. In case of Social Web applications, many approaches (like [Gjoka et al. 
2010; Catanese et al. 2011; Mislove et al. 2007; Yc et al. 2010]) arc devoted to 
extract friendship relationships from a Social Networking Websites. In such a 
case, extracted data are of the same format that can be interpreted as a triplet 
of the form (u x ,u yi f) being u x and u y the user identifiers of two Social Network 
members; the value / can be a boolean (e.g., it can be true if u x and u y arc 
friends, false otherwise) or it can be a numerical value (e.g., it can be set equal 
to the number of messages u x sent u y ). 

Some applications are designed to collect data of different type. A relevant ex- 
ample in the field of enterprise applications is provided by those applications 
designed to manage Business Intelligence tasks. These application are able to 
collect data of different type like (numerical data or textual ones) as well as 
to manage both structured data (e.g., data coming from relational tables of a 
DBMS) as well as unstructured ones (e.g., text snippets present in HTML pages). 
Relevant examples of applications capable of extracting data of different type are 
also present in the context of Social Web applications: for instance, [Kwak et al. 
2010] performed in 2009 a crawl of the whole Twitter platform which produced 
textual data (e.g., the tweets and re-tweets produced by users) as well as data 
indicating the different types of connections among users ( "following" , "reply to" 
and "mention") [Romero and Kleinberg 2010]. 

— Single Purpose or Multiple Purpose. In some cases, the goal of a Web Data 
Extraction tool is to extract data describing a specific facet of a particular the 

ACM Computing Surveys, Vol. V, No. N, July 2012. 



30 



Emilio Ferrara et al. 



same social phenomenon or business process. In such a case, an application is 
catalogued as unique purpose. Other applications, instead, aim at collecting data 
of different nature which, if properly linked, arc relevant to better understand and 
interpret a particular phenomenon. After linking data, novel and sophisticated 
applications running on them can be defined. Applications belonging to this 
category will be called multi purpose. 

As an example of single purpose applications, we focus on applications devoted 
to collect bibliographic data as well as citations among papers. To collect bib- 
liographical information, Web Data Extraction techniques are required to query 
several databases containing scientific publications (like ISI Web of Science, SCO- 
PUS, PubMed and so on) but the goal of the application, however, is to collect all 
the citations associated with a particular paper or to a particular author because 
this information is useful to assess the impact of an author or a scientific publica- 
tion in a scientific discipline. In the field of Social Web applications, an example 
of single purpose application is given by applications devoted at collecting data 
about human activities in different Social Web platforms: for instance, the tags 
contributed by a user in different systems [Szomszor et al. 2008]. This is relevant 
to understand if the language of a user is uniform across various platforms or if 
the platform features impact on the user formation of the vocabulary. 
We can cite some relevant examples of multi purpose applications both at the 
enterprise and at the Social Web level. For instance, in the class of enterprise 
applications, some applications are able to collect data produced by different Web 
Services and combine these data to produce more advanced applications. For 
instance, in the travel industry, one can think of applications collecting data on 
flights and hotels and combine these data to produce holiday packages. In the field 
of Social Web applications, several authors observed that the merging of different 
type of data describing different type of human activities produces a more detailed 
knowledge of human needs and preferences. For instance, in [Schifanella et al. 
2010] the authors conducted extensive experiments on data samples extracted 
from Last.Fm and Flickr. They showed that a a strong correlations exist between 
user social activities (e.g., the number of friends of a user or the number of groups 
she joined) and the tagging activity of the same user. A nice result provided in 
[Schifanella et al. 2010] was that user contributed tags are a useful indicator 
to predict friendship relationships. Other studies, involving the geographical 
location of a user with the content she posted (e.g., photos) are provided in 
[Crandall et al. 2009] and [Kinsclla et al. 2011]. 

4.1 Enterprise Applications 

In thsi section we describe the main features of software applications and procedures 
related with Web Data Extraction with a direct, subsequent or final commercial 
scope. 

4.1.1 Context-aware advertising. Context-aware advertising techniques aim at 
presenting to the end user of a Web site commercial thematized advertisements 
together with the content of the Web page the user is reading. The ultimate goal 
is to increase the value that a Web page can have for its visitors and, ultimately, 
to raise the level of interest in the ad. 
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First efforts to implement context-aware advertising were made by Applied Se- 
mantic, Inc. 8 ; subsequently Google bought their "AdSense" advertising solution. 

The implementation of context-aware advertisements requires to analyze the se- 
mantic content of the page, extract relevant information, both in the structure and 
in the data, and then contextualize the ads content and placement in the same 
page. 

Contextual advertising, compared to the old concept of Web advertising, repre- 
sents a intelligent approach to providing useful information to the user, statistically 
more interested in thematized ads, and a better source of income for advertisers. 

4.1.2 Customer care. Usually medium- and big-sized company, with customers 
support, handle a large amount of unstructured information available as text doc- 
uments. Relevant examples are emails, support forum discussions, documentation, 
shipment address information, credit card transfer reports, phone conversation tran- 
scripts, an so on. The ability of analyzing these documents and extracting the main 
concepts associated with them provides several concrete advantages. First of all, 
documents can be classified in a more effective fashion and this makes their retrieval 
easier. In addition, once the concepts present in a collection of documents have been 
extracted, it is possible to identify relevant associations between documents on the 
basis of the concepts they share. Ultimately, this enables to perform sophisticated 
data analysis targeted at discovering trends or, in case of forms, hidden associations 
among the products/services offered by a brand. 

In this scenario, Web Data Extraction techniques play a key role because they 
are required to quickly process large collections of textual documents and derive 
the information located in these documents. The retrieved data are finally pro- 
cessed by means of algorithms generally coming from the area of Natural Language 
Processing. 

4.1.3 Database building. In the Web marketing sector, Web Data Extraction 
techniques can be employed to gather data referring to a given domain. These 
data may have a twofold effect: (i) a design, through reverse engineering analysis, 
can design and implement a DBMS representing that data; (ii) the DBMS can 
be automatically populated by using data provided by the Web Data Extraction 
system. The activity of designing and building a DBMS starting from collected 
data will be called Database Building. 

Fields of application are countless: financial companies could be interested in 
extracting financial data from the Web and storing them in their DBMSs. Extrac- 
tion tasks are often scheduled in such a way as to be executed automatically and 
periodically. 

Also the real estate market is very florid: acquiring data from multiple Web 
sources is an important task for a real estate company, for comparison, pricing, 
co-offering, etc. 

Companies selling products or services probably want to compare their pricing 
with other competitors: products pricing data extraction is an interesting appli- 
cation of Web Data Extraction systems. Finally we can list other related tasks, 
obviously involved in the Web Data Extraction: duplicating an on-line database, 
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extracting dating sites information, capturing auction information and prices from 
on-line auction sites, acquiring job postings from job sites, comparing betting in- 
formation and prices, etc. 

4.1.4 Software Engineering. Extracting data from Web sites became interesting 
also for Software Engineering: for instance, Rich Internet Applications (RIAs) are 
rapidly emerging as one of the most innovative and advanced kind of application 
on the Web. RIAs are Web applications featuring a high degree of interaction and 
usability, inherited from the similarity to desktop applications. Amalfitano et al. 
[Amalfitano et al. 2008] have developed a reverse engineering approach to abstract 
Finite States Machines representing the client-side behavior offered by RIAs. 

4.1.5 Business Intelligence and Competitive Intelligence. Baumgartner et al. 
[Baumgartner et al. 2005; Baumgartner et al. 2009; Baumgartner et al. 2010] deeply 
analyzed how to apply Web Data Extraction techniques and tools to improve the 
process of acquiring market information. A solid layer of knowledge is fundamental 
to optimize the decision-making activities and an large amount of public information 
could be retrieved on the Web. They illustrate how to acquire these unstructured 
and semi-structured information. In particular, using the Lixto Suite to access, 
extract, clean and deliver data, it is possible to gather, transform and obtain infor- 
mation useful to business purposes. It is also possible to integrate these data with 
other common platforms for Business Intelligence, like SAP 9 or Microsoft Analysis 
Services [Melomed et al. 2006]. 

Wider, the process of gathering and analyzing information about products, cus- 
tomers, competitors with the goal of helping the managers of a firm in decisional 
processes is commonly called Competitive Intelligence, and is strictly related to data 
mining [Han and Kamber 2000]. Zanasi [Zanasi 1998] was the first to introduce the 
possibility of acquiring these data, through data mining processes, on public do- 
main information. Chen et al. [Chen et al. 2002] developed a platform, that works 
more like a spider than a Web Data Extraction system, which represents a useful 
tool to support Competitive Intelligence operations. 

In Business Intelligence scenarios, we ask Web Data Extraction techniques to 
satisfy two main requirements: scalability and efficient planning strategies because 
we need to extract as much data as possible with the smallest amount of resources 
in time and space. 

4.1.6 Web process integration and channel management. In the Web of today 
data is often available via APIs (e.g., refer to Programmable Web 10 ). Nevertheless, 
the larger amount of data is primarily available in semi-structured formats such as 
HTML. To use Web data in Enterprise Applications and service-oriented architec- 
tures, it is essential to provide means for automatically turning Web Applications 
and Web sites into Web Services, allowing a structured and unified access to het- 
erogeneous sources. This includes to understand the logic of the Web application, 
to fill out form values, and to grab relevant data. 

In a number of business areas, Web applications are predominant among business 



9 http://www.sap.com 

10 http: / /www. programmableweb. com / 
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partners for communication and business processes. Various types of processes are 
carried out on Web portals, covering activities such as purchase, sales, or quality 
management, by manually interacting with Web sites. Typical vertical examples, 
where Web Data Extraction proves useful include channel management in the travel 
industry (like automating the regular offerings of rooms on hotel portals with bi- 
directional Web connectors), re-packaging complex Web transactions to Web ser- 
vices and consequently to other devices, as well as automating communication of 
automotive suppliers with automotive companies. 

Tools for wrapper generation pave the way for Web Process Integration and 
enable the Web of Services, i.e., the seamless integration of Web applications into 
a corporate infrastructure or service oriented landscape by generating Web services 
from given Web sites [Baumgartner et al. 2010]. Web process integration can be 
understood as front-end and "outside-in" integration: integrate cooperative and 
non-cooperative sources without the need for information provider to change their 
backend. Additional requirements in such scenarios include to support a large 
number of users and real-time parametrized Web queries and support of complex 
Web transactions. 

4.1.7 Functional Web application testing. Testing and Quality Management are 
essential parts of the life-cycle of software. Facets of testing are manifold, including 
functional tests, stress/load tests, integration tests, and testing against specifica- 
tions, to name a few. Usually, the strategy is to automate a large percentage of 
functional tests and execute test runs as part of nightly builds as regression tests. 
Such tests occur at various levels, for example testing the system functionality as 
a blackbox via APIs, or testing the system at the GUI level simulating either the 
user's steps or creating a model of possible application states. 

In today's world of Software-as-a-Service platforms and Web oriented architec- 
tures, Web application testing plays an important role. One aspect is simulating 
the user's path through the application logic. Robust identification criteria can be 
created by taking advantage of the tree structure or visual structure of the page. 
Typical actions in such test scripts include to set/get values of form fields, picking 
dates, checkpoints to compare values, and following different branches depending 
on the given page. Due to automation, every step can be parametrized and a test 
script executed in variations. 

The requirements for tools in the area of Web application testing are to deal well 
with AJAX/dynamic HTML, to create robust test scripts, to efficiently maintain 
test scripts, to execute test runs and create meaningful reports, and, unlike other 
application areas, the support of multiple state-of-the-art browsers in various ver- 
sions is an absolute must. One widely used open source tool for Web application 
testing is Selenium 11 . 

4.1.8 Comparison shopping. One of the most appreciated services in e-commerce 
area is the comparison shopping, i.e., the capability of comparing products or ser- 
vices; various type of comparisons are allowed going from simple prices comparison 
to features comparison, technical sheets comparison, user experiences comparison, 
etc. 



n http:/ /sclcniumhq.org/ 
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These services heavily rely on Web Data Extraction, using Web sites as sources 
for data mining and a custom internal engine to make possible the comparison of 
similar items. 

Many Web stores today also offer personalization forms that make the extraction 
tasks more difficult: for this reason many last-generation commercial Web Data Ex- 
traction systems (e.g., Lixto, Kapow Mashup Server, UnitMincr 12 , Bget 13 ) provide 
support for deep navigation and dynamic content pages. 

4.1.9 Mashup scenarios. Today, leading software vendors provide mashup plat- 
forms (such as Yahoo! Pipes or Lotus Mashups) and establish mashup communi- 
cation standards such as EMML 14 . A mashup is a Web site or a Web application 
that combines a number of Web sites into an integrated view. Usually, the content 
is taken via APIs, embedding RSS or Atom Feeds in a REST-like way. With wrap- 
per technology, one can leverage legacy Web applications to light-weight APIs such 
as REST that can be integrated in mashups in the same fashion. Web Mashup 
Solutions no longer need to rely on APIs offered by the providers of sites, but can 
extend the scope to the whole Web. In particular, the deep Web gets accessible by 
encapsulating complex form queries and application logic steps into the methods of 
a Web Service. End users are put in charge of creating their own views of the Web 
and embed data into other applications ("consumers as producers"), usually in a 
light-weight way. This results in "situational applications" , possibly unreliable and 
unsecure applications, that however help to solve an urgent problem immediately. 
In Mashup scenarios, one important requirement of Web Data Extraction tools is 
the ease of use for non-technical content managers, to give them the possibility to 
create new Web connectors without help of IT experts. 

4.1.10 Opinion mining. Related to comparison shopping, the opinion sharing 
represents its evolution: users want to express opinions on products, experiences, 
services they enjoyed, etc. The most common form of opinion sharing is represented 
by blogs, containing articles, reviews, comments, tags, polls, charts, etc. All this 
information usually lacks of structure, so their extraction is a big problem, also for 
current systems, because of the billions of Web sources now available. Sometimes 
model-based tools fit good, taking advantage of common templates (e.g., Word- 
press 15 , Blogger 16 , etc.), other times Natural Language Processing techniques fit 
better. Kushal ct al. [Dave et al. 2003] approached the problem of opinion extrac- 
tion and subsequent semantic classification of reviews of products. 

Another form of opinion sharing in semi-structured platforms is represented by 
Web portals that let users to write unmoderated opinions on various topics. 



12 http:/ /www. qualityunit.com/unitminer/ 
13 http:/ /www. bget. com/ 
14 http:/ /www. opcnmashup.org 
15 http:/ /wordpress.org/ 
16 https://www.blogger.com/ 
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4.1.11 Citation databases. Citation databases building is an intensive Web Data 
Extraction field of application: CitcScer 17 , Google Scholar 18 and DBLP 19 , amongst 
others, are brilliant examples of applying Web Data Extraction to approach and 
solve the problem of collect digital publications, extract relevant data, i.e., ref- 
erences and citations, and build structured databases, where users can perform 
searches, comparisons, count of citations, cross-references, etc. 

Several challenges are related to this context of application: for example, the cor- 
pus of scientific publications could rapidly vary over time and, to keep the database 
updated it could be necessary to repeatedly apply a Web Data Extraction process 
from scratch on the same Web sources. Such an operation, however, can be ex- 
cessively time-consuming. An attempt to address this challenge has been done in 
[Chen ct al. 2008] : in that paper, the authors suggest an incremental solution which 
requires to identify portions of information shared by consecutive snapshots and to 
reuse the information extracted from a snapshot to the subsequent one. 

4.1.12 Web accessibility. Techniques for automatic data extraction and docu- 
ment understanding are extremely helpful in making Web pages more accessible to 
blind and partial-sighted users. 

Today's solution approaches are inefficient to overcome the problem. The first 
approach, screen-reader usage, is optimized for native client interfaces, and not 
well equipped to deal with the presentation, content and interactions in Web 2.0 - 
such as understanding the reading order, telling the user what a date picker means, 
or jumping from one forum post to the next. The second approach, the Web 
Accessibility Initiative, is no doubt absolutely necessary and has defined valuable 
concepts, such as ARIA roles assignable to GUI elements. However, due to the 
additional investment at best such guidelines are applied in governmental sites. 

Approaches such as ABBA [Fayzrakhmanov et al. 2010] overcome these limita- 
tions. In the ABBA approach, a Web page is transformed into a formal multi-axial 
semantic model; the different axes offer means to reason on and serialize the doc- 
ument by topological, layout, functional, content, genre and saliency properties. 
A blind person can navigate along and jump between these axes to skip to the 
relevant parts of a page. E.g., the presentational axis contains transformed visual 
cues, allowing the user to list information in the order of visual saliency. 

4.1.13 Main content extraction. Typical Web pages, for example news articles, 
contain, additionally to the main content, navigation menus, advertisements and 
templates. In some cases, such as when archiving a news article Web page for 
later offline reading it is convenient to get rid of such irrelevant fragments. To 
extract the main content only, one needs to apply techniques to distinguish the 
relevant content from the irrelevant one. Approaches range from complex visual 
Web page analysis to approaches leaning on text density or link density analysis. 
An approach for boilerplate detection using shallow text features was introduced 
in [Kohlschiitter et al. 2010]. Additionally, tools/apps such as InstaPaper or the 



http:/ /citeseer. ist.psu.edu/ 
http:/ /scholar. google. com/ 
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Readability Library 20 use main content extraction to store the relevant fragment 
and text from a Web page that resembles the article for later reading. 

4.1.14 Web (experience) archiving. Digital preservation and Web data curation 
are the goals of the discipline of Web Archiving. On the one hand, this means to 
access information no longer available in the live Web, and on the other hand also 
to reflect how the Web was used in former times. Events such as iPres and IWAW, 
and consortia such as Netpreserve, as well as various local Web archiving initiatives, 
tackle this task. There are numerous challenges [Masanes 2006] due to the facts 
that Web pages are ephemeral and due to unpredictable additions, deletions and 
modifications. Moreover, the hidden Web poses a further challenge. Approaches to 
Web Archiving are manifold, and range from Web Crawling/Harvesting, server-side 
archiving, transaction-based archiving, archiving the content of Web databases to 
library-like approaches advocating persistent identifiers (e.g., Digital Object Identi- 
fier - DOI). Especially in the use case about Web database content archiving, Web 
data extraction techniques are exploited. 

Another possibility of archiving the Web is to archive how Web pages are con- 
sumed. We can consider this as a type of event-based archiving, that makes sense 
especially for rich Web applications. The idea is not to archive everything, but to 
archive selectively sample pathes through an application. This requires to choose 
sample sites, get an understanding about common and frequent pathes through an 
application and store the interaction sequence. In a museum-like approach, such 
selected sequences are stored and can be restored or replayed to provide the user 
with the experience how the Web was consumed. Deep Web navigation techniques 
and form understanding are the key technologies here. 

4.1.15 Summary. Figure 9 summarizes the 14 discussed enterprise application 
scenarios. For each scenario wc describe the main ingredients of the value chain, as 
illustrated in the scenario descriptions. 

4.2 Social Web Applications 

In the latest years, Social Web platforms emerged as one of the most relevant 
phenomenon on the Web: these platforms are built around users, letting them to 
create a web of links between people, to share thoughts, opinions, photos, travel tips, 
etc. In such a scenario, often called Web 2.0 users turn from passive consumers of 
contents to active producers. 

Social Web platforms provide novel and unprecedented research opportunities. 
In fact, The analysis on a large, often planetary, scale of patterns of users interac- 
tions provides the concrete opportunity of answering questions like these: how does 
human relationships (e.g., friendship relationships) are created and evolve over time 
[Klcinbcrg 2008]? How does novel ideas spread and propagate through the web of 
human contacts [Bettencourt et al. 2006]? How does the human language evolve 
through social interactions (e.g., how do person expand their lexicon on the basis 
of their interactions with other persons) [Mathes 2004]? 

Besides scientific questions, the analysis of patterns of human interactions in So- 
cial Web platforms has also relevant implication at the business level: if we are 



20 The reader is referred to the PHP port http://www.keyvan.net/2010/08/php-readability/ 
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Fig. 9. Summary of the application domains in the enterprise context. 

able to understand the dynamics of interactions among humans, we are also able 
to identify how groups of users aggregate themselves around shared interests. This 
is a crucial step for marketing purposes: once users have been grouped, we can, for 
instance, selectively disseminate commercial advertisements only to those groups 
formed by users who are actually interested in receiving those advertisements. In 
an analogous fashion, the fabric of social interactions can be used to identify influ- 
ential users, i.e., those users whose commercial behaviors are able to stimulate the 
adoption/rejection of a given product by large masses of users. 

Finally, Social Web users often create accounts and/or profiles in multiple plat- 
forms [Szomszor et al. 2008; De Meo et al. 2009]. Correlating these accounts and 
profiles is a key step to understand how the design features and the architecture 
of a Social Web platform impact on the behavior of a user: so, for instance, one 
may ask whether some functionalities provided by a given platform augment the 
aptitude of users to socialize or they impact on the volume of contents produced 
by a user. Once the relationship between the features of a given platform and the 
behavior of a user has been elucidated, the designers/managers of that platform can 
provide novel services to raise the level of engagement of a user in the platform or 
to raise their degree of loyalty (e.g., to avoid users become inactive in the platform 
and migrate to other ones). 
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In the context above, Web Data Extraction techniques play a key role because 
the capability of timely gathering large amounts of data from one or more Social 
Web platforms is a an indefeasible tool to analyze human activities. Traditional 
Web Data Extraction techniques are challenged by new and hard problems both at 
the technical and scientific level. First of all, Social Web platforms feature a high 
level of dynamism and variability because the web of contacts of a user or a group 
of users may significantly vary in small time slots; therefore, we need to design Web 
Data extraction algorithms/procedures capable of gathering large amount of data 
in a quick fashion, in such a way as to the fragments of collected data keep pace with 
changes occurring in the structure of the user social network. If such a requirement 
is not satisfied, the picture emerging from the analysis we can carry out on the 
data at our disposal could be wrong and it would fail to capture the structure and 
evolution of human interactions in the platform. A second challenge depends on 
the fact that Web Data Extraction algorithms are able to capture only a portion 
of the data generated within one or more platforms. Therefore, we are required to 
check that the features of a data sample generated as the output of a Web Data 
Extraction algorithm replicate fairly well the structure of the original data within 
the platform(s). Finally, since data to gather are associated with humans or reflect 
human activities, Web Data Extraction techniques are challenged to provide solid 
guarantees that user privacy is not violated. 

In the remaining of this section we will provide an answer to these questions and 
illustrate how Web Data Extraction techniques have been applied to collect data 
from Social Web platforms. First of all, we describe how to gather data about 
user relationships and activities within a single social platform (see Section 4.2.1). 
Secondly, we consider users who created and maintain an account in multiple Social 
Web platforms and discuss issues and challenges related to data collection in this 
scenario (see Section 4.2.2). 

4.2.1 Extracting data from a single Online Social Web platform. In this section 
we first discuss the technological challenges arising if we aim at collecting data about 
social relationships, user activities and the resources produced and shared by users. 
Subsequently, we discuss privacy risks associated with the consumption/usage of 
human related data. 

Technical challenges for collecting data from a single Social Web platform. We 
can classify techniques to collect data from a Social Web platform into two main 
categories: the former category relics on the usage of ad-hoc APIs, usually provided 
by the Social Web platform itself; the latter relies on HTML scraping. 

As for the first category of approaches, we point out that, today, Social Web plat- 
forms provide powerful APIs (often available in multiple programming languages) 
allowing to retrieve in an easy and quick fashion a wide range of information from 
the platform itself. This information, in particular, regards not only social connec- 
tions involving members of the platforms but also the content the users posted and, 
for instance, the tags they applied to label available content. 

We can cite the approach of [Kwak et al. 2010] as a relevant example of how 
to collect data from a Social Web platform by means of an API. In that paper, 
the authors present the results of the crawling of the whole Twitter platform. The 
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dataset described in [Kwak et al. 2010] built consisted of 41.7 million user pro- 
files, 1.47 billion social relations; in addition to collecting information about user 
relationships, the authors gathered also information on tweets and, by performing 
a semantic analysis, also on the main topics discussed in these tweets. The final 
dataset contained 4,262 trending topics and 106 million tweets. 

From a technical standpoint we want to observe that: (i) The Twitter API 
allows to access the whole social graph, i.e., the graph representing users and their 
connections in Twitter, without authentication. Other Social Web platforms and 
the APIs they offer, however, do not generally allow to access the whole social 
graph. A meaningful example is given by the Facebook API. (ii) The Twitter API, 
by default, allows a human user or a software agent to send only 150 requests per 
hour: this could be an inadmissible limitation because the amount of information 
generated within Twitter in a relatively small time slot can be very large, and, 
then, changes in the Twitter network topology could not be properly sensed. To 
overcome this problem, Twitter offers white lists: users registered to white lists can 
send up to 20 000 requests per IP per hour. In [Kwak et al. 2010] the authors 
used a group of 20 computers, each of them belonging to the Twitter white lists, 
to perform a real-time monitoring of Twitter. 

Approaches based on the scraping of HTML pages are able to overcome the 
limitations above even if they are more complicated to design and implement. To 
the best of our knowledge, one of the first attempt to crawl large Online Social 
Networks was performed by Mislove et al. [Mislove et al. 2007]. In that paper, 
the authors focused on platforms like Orkut, Flickr and LiveJournal. To perform 
crawling, the approach of [Mislove et al. 2007] suggests to iteratively retrieve the 
list of friends of a user which have not yet been visited and to add these contact to 
the list of users to visit. 

According to the language of the graph theory, this corresponds to perform a 
Breadth-First-Search (BFS) visit of the Social Network graph. The user account 
from which the BFS starts is often called seed node] the BFS ends when the whole 
graph is visited or, alternatively, a stop criterium is met. The BFS is easy to 
implement and efficient; it produces accurate results if applied on social graphs 
which can be modeled as unweighted graphs. Due to these reasons, it has been 
applied in a large number of studies about the topology and structure of Online 
Social Networks (see, for instance, [Chau et al. 2007; Wilson et al. 2009; Gjoka 
et al. 2010; Ye et al. 2010; Catanese et al. 2010]). 

As observed by [Mislove et al. 2007], BFS may incur in heavy limitations. First 
of all, a crawler can get trapped into a strong connected component of the social 
graph. In addition, if we would use the BFS sample to estimate some structural 
properties of the social network graph, some properties could be overestimated 
while others could be underestimated [Kurant et al. 2010]. 

To alleviate these problems, several authors suggested more refined sampling 
techniques. The implementation of these techniques is equivalent to define new 
Web Data Extraction procedures. Most of these techniques have been discussed 
and exploited in the context of Facebook [Gjoka et al. 2010; Catanese et al. 2010; 
Catanese et al. 2011] but, unfortunately, some of them can not be extended to other 
platforms. 
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In particular, Gjoka et al. [Gjoka et al. 2010] considered different visiting al- 
gorithms, like BFS, "Random Walks" and "Metropolis-Hastings Random Walks". 
A particular mention goes to a visiting method called rejection sampling. This 
technique relies on the fact that a truly uniform sample, of Facebook users can be 
obtained by generating uniformly at random a 32-bit user ID and, subsequently, 
by polling Facebook about the existence of that ID. The correctness of this pro- 
cedure derives from the fact that each Facebook user was uniquely identified by a 
numerical ID ranging from and 2 32 — 1. Of course, such a solution works well for 
Facebook but it could not work for other platforms. 

In [Catanese et al. 2010; Catanese et al. 2011], the authors designed a Web Data 
Extraction architecture based on Intelligent Agents, which is reproduced in Figure 
10. Such an architecture consists of three main components: (i) a server running 
the mining agent (s); (ii) a cross-platform Java application, which implements the 
logic of the agent; (Hi) an Apache interface, which manages the information transfer 
through the Web. The proposed architecture is able to implement several crawling 
strategies like BFS or rejection sampling. 

The sampling procedure in [Catanese et al. 2010; Catanese et al. 2011] works as 
follows: an agent is activated and it queries the Facebook server(s) to obtain the 
list of Web pages representing the list of friends of a Facebook user. Of course, the 
Facebook account to visit depends on the basis of the crawling algorithm we want 
to implement. After parsing this list of pages, it is possible to reconstruct a portion 
of the Facebook network. Collected data can be converted in XML format in such 
a way as to they can be exploited by other applications (e.g., network visualization 
tools). 
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Fig. 10. Architecture of the data extraction platform proposed in [Catanese et al. 2011] 



Unfortunately, some technical limitations limit the effectiveness of the process 
described above. In particular, Facebook provides shortened friend-lists not ex- 
ceeding 400 Facebook members to reduce traffic through its network. To avoid 
these limitations, other techniques like visual data extraction techniques have to be 
considered. 

In addition to gathering data about social relationship, we may collect contents 
generated by users. These contents may vary from resources posted by users (like 
photos in Flickr or videos in YouTube) to tags applied for labeling resources with 
the goal of making the retrieval of these resources easier or to increase the visibility 
of the contents they generated. 

Extracting contents from a Social Web platform do not pose meaningful technical 
problems but it should be faster and easier than in other fields: HTML-aware 
and model-based extraction systems should fit very well with the semi-structured 



ACM Computing Surveys, Vol. V, No. N, July 2012. 



Web Data Extraction, Applications and Techniques: A Survey • 41 

templates used by most common Social Networking services. Sometimes data are 
distributed under structured formats like RSS/Atom so acquiring this information 
is easier than with traditional HTML sources. 

Privacy pitfalls in collecting user related data. The major concern about the 
extraction of data from Social Web platforms is about privacy. Several researchers, 
in fact, showed that we can disclose private information about users by leveraging 
on publicly available information available in a Social Web platform. For instance, 
[Hecht et al. 2011] provided an approach to finding user location on the basis of user 
tweets. The authors used basic machine learning techniques which combined user 
tweets with geotagged articles in Wikipedia. In an analogous fashion, [Kinsclla 
et al. 2011] used geographic coordinates extracted from geotagged Twitter data 
to model user location. Locations were modeled at different levels of granularity 
ranging from the zip code to the country level. Experimental studies in [Kinsclla 
et al. 2011] show that the proposed model can predict country, state and city with 
an accuracy comparable with that achieved by industrial tools for geo-localization. 

Crandall et al. [Crandall et al. 2009] investigated how to organize a large collec- 
tion of geotagged photos extracted from Flickr. The proposed approach combined 
content analysis (based on the textual tags typed by users to describe photos), 
image analysis and structural analysis based on geotagged data. The most rele- 
vant result is that it is possible to locate Flickr photos with a high precision by 
identifying landmarks via visual, temporal and textual features. 

In [Chaabane et al. 2012] the authors considers a range of user signals expressing 
user interests (e.g., "I like" declaration in Facebook) which are often disclosed by 
users. By means of a semantic-driven inference technique based on an ontologized 
version of Wikipedia, the authors show how to discover hidden information about 
users like gender, relationship status and age. 

From the discussion above it emerges that collecting seemingly innocuous data 
from a Social Web platform hides high privacy risks. Therefore, collected data 
should be manipulated in such a way as to reduce these risks. 

4.2.2 Extracting data from multiple Online Social Web platforms. Social Web 
users often create and maintain different profiles in different platforms with different 
goals (e.g., to post and share their photos in Flickr, to share bookmarks on Delicious, 
to be aware on job proposals on Linkcdln and so on). 

We first discuss technical approaches to collecting data from multiple Social Web 
platforms as well as the opportunities coming from the availability of these data. 
We finally describe potential privacy risks related to the management of gathered 
data. 

Collecting data from multiple Social Web platforms. The main technical chal- 
lenges encountered by Web Data Extraction techniques to collect data from multi- 
ple Social Web platforms consists of linking information referring to the same user 
or the same object. Early approaches were based on "ad-hoc" techniques; subse- 
quent approaches featured a higher level of automation and were based on ad-hoc 
APIs like the Google Social Graph API. 

To the best of our knowledge, one of the first approaches facing the problem 
of correlating multiple user accounts was presented in [Szomszor et al. 2008]. In 

ACM Computing Surveys, Vol. V, No. N, July 2012. 



42 • Emilio Ferrara et al. 

that paper, the authors started with a list of 667, 141 user accounts on Delicious 
such that each account was uniquely associated with a Delicious profile. The same 
procedure was repeated on a data sample extracted from Flickr. The first stage in 
the correlation process consisted in comparing usernames in Delicious and Flickr: 
if these strings exactly match, the two accounts were considered as referring to the 
same person. In this way, it was possible to build a candidate list consisting of 232, 
391 usernames such that each user name referred to a Flickr and Delicious profile. 
Of course, such a list must be refined because, for instance, different users may 
choose the same username. Since both in Delicious and Flickr the users had the 
chance of filling a form by specifying their real names, the authors of [Szomszor 
et al. 2008] suggested to refine the candidate list by keeping only those user accounts 
whose real names matched exactly. Such a procedure significantly lowered the risk 
of incurring in false associations but at the same time, dramatically reduced the size 
of the candidate list to only 502 elements. Some tricks aiming at producing a larger 
and accurate dataset were also proposed in [Szomszor et al. 2008]: in particular, 
the authors observed that, in real scenarios, if a user creates accounts in several 
Web sites, she frequently adds a link to her accounts: so by using traditional search 
engines like Google, we can find all pages linking to the homepage of a user and, 
by filtering these hits we can find the exact URL of the profile of a user in different 
platforms. 

The process described above can be automatized by exploiting ad hoc tool. 
Among these tools, the most popular was perhaps the Google Social Graph API, 
even if such an API is no longer available. Such an API is able to find connections 
among persons on the Web. It can be queried through an HTTP request having a 
URL called node as its parameter. The node specifies the URL of a Web page of a 
user u. 

The Google Social Graph API is able to return two kinds of results: 

— A list of public URLs that are associated with u; for instance, it reveals the URLs 

of the blog of u and of her Twitter page. 
— A list of publicly declared connections among users. For instance, it returns the 

list of persons who, in at least one social network, have a link to a page which 

can be associated with u. 

The Google Social Graph API has been used in [Abel et al. 2012]. 

The identification of the connections between persons or Web objects (like photos 
or videos) is the key step to design advance services often capable of offering a high 
level of personalization. 

For instance, in [Shi et al. 2011], the authors suggest to use tags present in dif- 
ferent Social Web systems to establish links between items located in each system. 
In [Abel et al. 2012] the system Mypes is presented. Mypes supports the linkage, 
aggregation, alignment and semantic enrichment of user profiles available in various 
Social Web systems, such as Flickr, Delicious and Facebook. In the field of Recom- 
mender Systems, the approach of [De Meo et al. 2009] show how to merge ratings 
provided by users in different Social Web platforms with the goal of computing 
reputation values which are subsequently used to generate recommendations. 

Some authors have also proposed to combine data coming from different platforms 
but referring to the same object. A nice example has been provided in [Stewart 
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et al. 2009]; in that paper the authors consider users of blogs concerning music 
and users of Last.fm, a popular folksonomy whose resources are musical tracks. 
The ultimate goal of [Stewart et al. 2009] is to enrich each Social Web system by 
re-using tags already exploited in other environments. This activity has a twofold 
effect: it first allows the automatic annotation of resources which were not originally 
labeled and, then, enriches user profiles in such a way that user similarities can be 
computed in a more precise way. 

Privacy risks related to the management of user data spread in multiple plat- 
forms. The discussion above show that the combination and linkage of data spread 
in independent Social Web platforms provide clear advantages in term of item de- 
scription as well as on the level of personalization that a system is able to provide to 
its subscribers. However, the other side of the coin is given by the privacy problems 
we may incur when we try to glue together data residing on different platforms. 

Some authors recently introduced the so-called user identification problem, i.e., 
they studied what information is useful to disclose links between the multiple ac- 
counts of a user in independent Social Web platforms. 

One of the first approaches to dealing with the user identification problem was 
proposed in [Vosecky et al. 2009]. In that paper the authors focused on Facebook 
and StudiVZ 21 and investigated which profile attributes can be used to identify 
users. 

In [Zafarani and Liu 2009], the authors studied 12 different Social Web systems 
(like Delicious, Flickr and YouTube) with the goal of finding a mapping involving 
the different user accounts. This mapping can be found by applying a traditional 
search engine. 

In [Iofciu et al. 2011], the authors suggest to combine profile attributes (like user- 
names) with an analysis of the user contributed tags to identify users. They suggest 
various strategies to compare the tag-based profiles of two users and some of these 
strategies were able to achieve an accuracy of almost 80% in user identification. 

[Perito et al. 2011] explored the possibility of linking users profiles only by looking 
at their usernames. Such an approach is based on the idea that the probability that 
two usernames are associated with the same person depends on the entropies of the 
strings representing usernames. 

[Balduzzi et al. 2010] described a simple but effective attack to discover the 
multiple digital identities of a user. Such an attack depends on the fact that a 
user often subscribes to multiple Social Web platforms by means of a single e- 
mail address. An attacker can query Social Web platforms by issuing a list of 
e-mail addresses to each Social Web platform; once the profiles of a user have been 
identified in each platform the data contained in these profiles can be merged to 
obtain private personal information. The authors considered a list of about 10.4 
millions of e-mail addresses and they were able to automatically identify more 
than 1.2 millions of user profiles registered in different platforms like Facebook and 
XING. The most popular providers acknowledged the privacy problem raised in 
[Balduzzi et al. 2010] and implemented ad hoc countermeasures. 

An interesting study is finally presented in [Goga et al. 2012]. The authors 
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showed that correlating seemingly innocuous attributes describing user activities 
in multiple social networks allow attackers to track user behaviors and infer their 
features. In particular, the analysis of [Goga et al. 2012] focused on three features 
of online activity like the geo-location of users posts, the timcstamp of posts, and 
the user's writing style (captured by proper language models). 

4.3 Opportunities for cross-fertilization 

In this section we discuss on the possibility of re-using Web Data Extraction tech- 
niques originally developed in a given application domain to another domain. This 
discussion is instrumental in highlighting techniques which can be applied across 
different application domains and techniques which require some additional infor- 
mation (which can be present in some application domains and missing in other 
ones). 

In the most general case, no assumption is made on the structure and content 
of a collection of Web pages which constitute the input of a Web Data Extraction 
tool. Each Web page can be regarded as a text document and, in such a configu- 
ration, approaches relying on Regular Expressions can be applied independently of 
the application domain we are considering. As mentioned before, regular expres- 
sions can be regarded as a formal language allowing to find strings or patterns from 
text according to some matching criteria. Approaches based on regular expressions 
therefore may be suitable when we do not have, at our disposal, any information 
about the structure of the Web pages. Of course, these approaches can be extended 
in such a way as to take some structural elements into account (like HTML tags). 
The usage of regular expression may be disadvantageous if need to collect large col- 
lections of documents and assume that those documents deal with different topics. 
In such a case, we need complex expressions for extracting data from documents 
and this would require a high level of expertise. 

The next step consists of assuming that some information on the structure of 
the document is available. In such a case, wrappers are an effective solution and 
they are able to work fairly well in different application domains. In fact, the hi- 
erarchical structure induced by HTML tags associated with a page often provides 
useful information to the Web Data Extraction task. A powerful solution taking 
advantage of the structure of HTML pages derives from the the usage of XPath to 
quickly locate an element (or multiple instances of the same element in the docu- 
ment tree) . Approaches based on XPath have been first exploited in the context of 
Enterprise applications and, later, they have been successfully re-used in the con- 
text of Social Web applications. There are, however, a number of requirements an 
application domain should satisfy in order to enable the usage of XPath. First of 
all, the structure of Web pages should be perfectly defined to the wrapper designer 
(or to the procedure inducing the wrapper) ; such an assumption, of course, can not 
be true in some domains because data are regarded as proprietary and technical 
details about the structure of the Web wrappers in that domain are not available. 
In addition, we should assume a certain level of structural coherence among all the 
Web pages belonging to a Web source. Such an assumption is often true both in 
enterprise domain and in many Social Web systems: in fact, Web pages within an 
organization (e.g., a firm) or a Social Web platform often derive from the same 
template and, therefore, a certain form of structural regularity across all the pages 
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Fig. 11. Facebook Crawling by means of a visual extraction approach. The goal of the wrapper 
is to extract the names of the friends of a user (highlighted in the figure by means of red frames) . 

emerge. By contrast, if we plan to manage pages with different structures (even if 
referring to the same domain) , we observe that even small changes in the structure 
of two Web pages may have a devastating impact and thus would require to entirely 
rewrite the wrapper. 

A nice example of cross-fertilization is in the context of the crawling of Online 
Social Networks [Catanese et al. 2011; Catanese et al. 2010]. In that paper the 
authors implemented a Web wrapper to crawl Facebook largely exploiting tech- 
niques and algorithms which were part of the Lixto suite and that were originally 
designed to work for Business and Competitive Intelligence applications. The pro- 
posed wrapper is based on the execution of XPath queries; human experts, in the 
configuration phase, are allowed to specify what elements have to be selected. 

The crawler provides two running modes: (i) visual extraction and, (ii) HTTP 
request-based extraction. In the visual extraction mode, depicted in Figure 11, the 
crawler embeds a Firefox browser interfaced through XPCOM 22 and XULRunner 23 . 
The visual approach requires the rendering of the Web pages which is a time- 
consuming activity. Therefore, to extract large amounts of data, [Catanese et al. 
2010] suggested to send HTTP requests to fetch Web pages. 

In some application scenarios, in addition to assuming that the syntactic struc- 
ture of a Web page is known, it is possible to assume that a rich semantic structure 
emerges from the Web pages. If such an hypothesis holds true, techniques from 
Information Extraction (IE) and Natural Language Processing (NLP) can be con- 
veniently used [Winograd 1972; Berger et al. 1996; Manning and Schiitze 1999]. The 
range of applications benefiting from NLP techniques comprises relevant examples 
both in the enterprise and Social Web scenarios: for instance, relevant applica- 
tions of NLP/IE techniques are the extraction of facts from speech transcriptions 
in forums, email messages, newspaper articles, resumes etc. 
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5. CONCLUSIONS AND ONGOING WORK 

The World Wide Web contains a large amount of unstructured data. The need for 
structured information urged researchers to develop and implement various strate- 
gies to accomplish the task of automatically extracting data from Web sources. 
Such a process is known with the name of Web Data Extraction and it has had 
(and continues to have) a wide range of applications in several fields, ranging from 
commercial to Social Web applications. 

The central thread of this survey is to classify existing approaches to Web Data 
Extraction in terms of the applications for which they have been employed. 

In the first part of this paper, we provided a classification of algorithmic tech- 
niques exploited to extract data from Web pages. We organized the material by 
presenting first basic techniques and, subsequently, the main variants to these tech- 
niques. Finally, we focus on how Web Data Extraction systems work in practices. 
We provide different perspectives to classify Web Data Extraction systems (like the 
ease of use, the possibility of extracting data from the Deep Web and so on). 

The second part of the survey is about the applications of Web Data Extraction 
systems to real-world scenarios. We provided a simple classification framework in 
which existing applications have been grouped into two main classes (Enterprise and 
Social Web applications). The material has been organized around the application 
domains of Web Data Extraction systems: we identify, for each class, some sub- 
domains and described how Web Data Extraction techniques work in each domain. 
This part ends with a discussion about the opportunities of cross-fertilization. 

5.1 A glance on the future 

In this section we discuss some recent applications of Web Data Extraction tech- 
niques, whose importance will certainly increase in the future. A special emphasis 
is due to the application of Web Data Extraction techniques in the fields of Bio- 
informatics and Scientific Computing (see Section 5.1), Web Harvesting (see Section 
5.1) and, finally, on applications devoted to link data sources on the Web to improve 
the effectiveness of other applications (see Section 5.1). 

Bio-informatics and Scientific Computing. A growing field of application of the 
Web Data Extraction is bio-informatics: on the World Wide Web it is very common 
to find medical sources, in particular regarding bio-chemistry and genetics. Bio- 
Informatics is an excellent example of the application of scientific computing - refer 
e.g. to [Descher et al. 2009] for a selected scientific computing project. 

Plake et al. [Plake et al. 2006] worked on PubMed 24 - the biggest repository of 
medical-scientific works that covers a broad range of topics - extracting information 
and relationships to create a graph; this structure could be a good starting point 
to proceed in extracting data about proteins and genes, for example connections 
and interactions among them: this information can be usually found, not in Web 
pages, rather they are available in PDF or Postscript format. In the future, Web 
Data Extraction should be extensively used also to these documents: approaches 
to solving this problem are going to be developed, inherited, both from Information 
Extraction and Web Data Extraction systems, because of the semi-structured for- 
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mat of PostScript-based files. On the other hand, web services play a dominant role 
in this area as well, and another important challenge is the intelligent and efficient 
querying of Web services as investigated by the ongoing SeCo project 25 . 

Web harvesting. One of the most attractive future applications of the Web Data 
Extraction is Web Harvesting [Weikum 2009]: Gatterbauer [Gatterbauer 2009] de- 
fines it as "the process of gathering and integrating data from various heterogeneous 
Web sources". The most important aspect (although partially different from spe- 
cific Web Data Extraction) is that, during the last phase of data transformation, 
the amount of gathered data is many times greater than the extracted data. The 
work of filtering and refining information from Web sources ensures that extracted 
data lie in the domain of interest and are relevant to users: this step is called 
integration. The Web harvesting remains an open problem with large margin of 
improvement: because of the billions of Web pages, it is a computational problem, 
also for restricted domains, to crawl enough sources from the Web to build a solid 
ontological base. There is also a human engagement problem, correlated to the de- 
gree of automation of the process: when and where humans have to interact with 
the system of Web harvesting? Should be a fully automatic process? What degree 
of precision can we accept for the harvesting? All these questions are still open for 
future works. Projects such as the DIADEM 26 at Oxford University tackle the 
challenge for fully automatic generation of wrappers for restricted domains such as 
real estate. 

Linking Data from several Web sources. A growing number of authors suggest 
to integrate data coming from as many sources as possible to obtain a detailed 
description of an object. Such a description would be hard to obtain if we would 
focus only on a single system/service. 

The material discussed in this section partially overlaps with some ideas/techniques 
presented in Section 4.2.2. However, in Section 4.2.2 we focused on Social Web plat- 
forms and showed that linking information stored in the profiles of a user spread 
across multiple platforms leads to a better identification of user needs and, ulti- 
mately, raises the level of personalization a system can offer to her. In this section, 
by contrast, we focus on systems having two main features, (i) they often publicly 
expose their data on the Web and (ii) these systems are lacking of a social conno- 
tation, i.e., they do not target at building a community of interacting members. 

A first research work showing the benefits of linking data provided by indepen- 
dent Web systems is provided in [Szomszor et al. 2007]. In that paper, the authors 
combined information from the Internet Movie Database www.imdb.com) and Net- 
flix (www.netflix.com). The IMDB is an online database containing extensive 
information on movies, actors and television shows. IMDB users are allowed to add 
tags to describe the main features of a movie (e.g., the most important scenes, lo- 
cation, genres and so on). Netflix is an US-based company offering an online DVD 
rental service. Netflix users are allowed to rate a movie by providing a score. In 
[Szomszor et al. 2007], data from Netflix and IMDB were imported in a relational 
DBMS; movie titles in IMDB were correlated with movie titles in Netflix by ap- 
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plying string matching techniques. In this way, for each movie, the Netflix ratings 
and the IMDB description were available. The authors studied three recommen- 
dation strategies based on ratings alone, on tags alone and, finally, a combination 
of ratings and tags. Experimental trials provided evidence that the combination 
of data located in different systems was able to improve the level of accuracy in 
the provided recommendations. Spiegel et al. presented an analogous study in 
[Spiegel et al. 2009]; in that paper the authors combine user ratings (coming from 
MovieLens database) with movie information provided by IMDB. 

The linkage of datascts coming from independent Web platforms fuels novel scien- 
tific applications. One of these application is given from cross-domain recommender 
systems, i.e., on recommender systems running on multiple domains. For example, 
one can use user ratings about movies to predict user preferences in the music 
domain. In a given domain (e.g., if our main aim is to recommend movies), infor- 
mation about user preferences may be insufficient and, in many case, most of this 
information is missing for a non- negligible part of user population. By contrast, a 
relevant amount of information describing user preferences could be available (at 
a limited cost) in other domains. We could transfer this information from a do- 
main to another one with the goal of dealing with data sparsity and improving 
recommendation accuracy. 

Various strategies have been proposed to perform such a transfer of information 
and some of these strategies take advantages from models and algorithms devel- 
oped in the field of transfer of learning [Pan and Yang 2010]. For instance, some 
approaches use co-clustering [Li et al. 2009a], other clustering techniques in con- 
junction with probabilistic models [Li et al. 2009b] and, finally, other approaches 
project the user and item space in the various domains in a shared latent space by 
means of regularization techniques [Zhang et al. 2010]. 

REFERENCES 

Abel, F., Herder, E., Houben, G.-J., Henze, N., and Krause, D. 2012. Cross-system User 
Modeling and Personalization on the Social Web. User Modeling and User-Adapted Interaction 
(UMUAI), To Appear. 

Amalfitano, D., Fasolino, A. Ft., and Tramontana, P. 2008. Reverse engineering finite state 
machines from rich internet applications. In WCRE '08: Proc. of the 2008 15th Working 
Conference on Reverse Engineering. IEEE Computer Society, Washington, DC, USA, 69—73. 

Backstrom, L., Boldi, P., Rosa, M., Ugander, J., and Vigna, S. 2011. Four degrees of 
separation. Arxiv preprint arXiv: 1 1 1 1 .4570 . 

Balduzzi, M., Platzer, C, Holz, T., Kirda, E., Balzarotti, D., and Kruegel, C. 2010. 
Abusing social networks for automated user profiling. In Proc. of the International Symposium 
on Recent Advances in Intrusion Detection (RAID 2010). Lecture Notes in Computer Science. 
Springer, Ottawa, Ontario, Canada, 422-441. 

Baumgartner, R., Campi, A., Gottlob, C, and Herzog, M. 2010. Web data extraction for 
service creation. Search Computing: Challenges and Directions. 

Baumgartner, R., Ceresna, M., and Ledermuller, G. 2005. Deepweb navigation in web data 
extraction. In CIMCA '05: Proc. of the International Conference on Computational Intel- 
ligence for Modelling, Control and Automation and International Conference on Intelligent 
Agents, Web Technologies and Internet Commerce Vol-2 (CIMCA-IAWTIC'06). IEEE Com- 
puter Society, Washington, DC, USA, 698-703. 

Baumgartner, R., Flesca, S., and Gottlob, G. 2001a. The elog web extraction language. 
In LPAR '01: Proc. of the Artificial Intelligence on Logic for Programming. Springer- Verlag, 
London, UK, 548-560. 

ACM Computing Surveys, Vol. V, No. N, July 2012. 



Web Data Extraction, Applications and Techniques: A Survey • 49 



Baumgartner, R., Flesca, S., and Gottlob, G. 2001b. Visual web information extraction with 

lixto. In VLDB '01: Proc. of the 27th International Conference on Very Large Data Bases. 

Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 119—128. 
Baumgartner, R., Frolich, O., Gottlob. G., Harz, P., Herzog, M., Lehmann, P., and Wien, 

T. 2005. Web data extraction for business intelligence: the lixto approach. In Proc. of BTW 

2005. 48-65. 

Baumgartner, R., Froschl, K., Hronsky, M., Pottler, M., and Walchhofer, N. 2010. 
Semantic online tourism market monitoring. Proc. of the 17th ENTER eTourism International 
Conference . 

Baumgartner, R., Gatterbauer, W., and Gottlob, G. 2009. Web data extraction system. 

Encyclopedia of Database Systems, 3465-3471. 
Baumgartner, R., Gottlob, G., and Herzog, M. 2009. Scalable web data extraction for online 

market intelligence. Proc. of the International Conference on Very Large Databases 2, 2, 1512- 

1523. 

Berger, A. L., Pietra, V. J. D., and Pietra, S. A. D. 1996. A maximum entropy approach to 

natural language processing. Comput. Linguist. 22, 1, 39-71. 
Berthold. M. and Hand, D. J. 1999. Intelligent Data Analysis: An Introduction. Springcr- 

Verlag New York, Inc., Secaucus, NJ, USA. 
Bettencourt, L., Cintron- Arias, A., Kaiser, D., and Castillo-Chavez, C. 2006. The power 

of a good idea: Quantitative modeling of the spread of ideas from epidemiological models. 

Physica A: Statistical Mechanics and its Applications 364, 513-536. 
Catanese, S., De Meo, P., Ferrara, E., and Fiumara, G. 2010. Analyzing the facebook 

friendship graph. In Proc. of the 1st International Workshop on Mining the Future Internet. 

14-19. 

Catanese, S., De Meo, P., Ferrara, E., Fiumara, G., and Provetti, A. 2011. Crawling 

facebook for social network analysis purposes. In Proc. of the International Conference on 

Web Intelligence, Mining and Semantics (WIMS 2011). ACM, Sogndal, Norway, 52. 
Chaabane, A., CS, G., and Kaafar, M. 2012. You Arc What You Like! Information Leakage 

Through Users' Interests. In Proc. of the Annual Network and Distributed System Security 

Symposium (NDSS 2012). 
Chang, C.-H., Kayed, M., Girgis, M. R., and Shaalan, K. F. 2006. A survey of web information 

extraction systems. IEEE Trans, on Knowl. and Data Eng. 18, 10, 1411-1428. 
Chau, D., Pandit, S., Wang, S., and Faloutsos, C. 2007. Parallel crawling for online social 

networks. In Proceedings of the 16th International Conference on the World Wide Web. 1283- 

1284. 

Chen, F., Doan, A., Yang, J., and Ramakrishnan, R. 2008. Efficient information extraction over 
evolving text data. In Proc. of the IEEE 24th International Conference on Data Engineering 
(ICDE 2008). IEEE, Cancun, Mexico, 943-952. 

Chen, H., Chau, M., and Zeng, D. 2002. Ci spider: a tool for competitive intelligence on the 
web. Decis. Support Syst. 34, 1, 1-17. 

Chen, W. 2001. New algorithm for ordered tree-to-tree correction problem. Journal of Algo- 
rithms 40, 2, 135-158. 

Collins, M. 1996. A new statistical parser based on bigram lexical dependencies. In Proceed- 
ings of the 34th annual meeting on Association for Computational Linguistics. Association for 
Computational Linguistics, 184-191. 

Crandall, D., Backstrom, L., Huttenlocher, D., and Kleinberg, J. 2009. Mapping the 
world's photos. In Proc. of the International Conference on World Wide Web (WWW 2009). 
ACM, Madrid, Spain, 761-770. 

Crescenzi, V. and Mecca, G. 2004. Automatic information extraction from large websites. J. 
ACM 51, 5, 731-779. 

Crescenzi, V., Mecca, G., and Merialdo, P. 2001. Roadrunner: Towards automatic data 
extraction from large web sites. In VLDB '01: Proc. of the 27th International Conference on 
Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 109-118. 

ACM Computing Surveys, Vol. V, No. N, July 2012. 



50 



Emilio Ferrara et al. 



Crescenzi, V., Mecca, G., and Merialdo, P. 2004. Improving the expressiveness of roadrunner. 
In SEBD. 62-69. 

Dave, K., Lawrence, S., and Pennock, D. M. 2003. Mining the peanut gallery: opinion extrac- 
tion and semantic classification of product reviews. In WWW '03: Proc. of the 12th interna- 
tional conference on World Wide Web. ACM, New York, NY, USA, 519-528. 

De Meo, P., Nocera, A., Quattrone, G., Rosaci, D., and Ursino, D. 2009. Finding re- 
liable users and social networks in a social internetworking system. In Proc. of the Interna- 
tional Database Engineering and Applications Symposium (IDEAS 2009). ACM Press, Cetraro, 
Cosenza, Italy, 173-181. 

Descher, M., Feilhauer, T., Ludescher, T., Masser, P., Wenzel, B., Brezany, P., Elsayed, 

I., Wohrer, A., TjOA, A. M., AND Huemer, D. 2009. Position paper: Secure infrastructure 

for scientific data life cycle management. In ARES. 606-611. 
Fayzrakhmanov, R., Goebel, M., Holzinger, W., Kruepl, B., Mager, A., AND Baumgartner, 

R. 2010. Modelling web navigation with the user in mind. In Proc. of the 7th International 

Cross- Disciplinary Conference on Web Accessibility. 
Ferrara, E. and Baumgartner, R. 2011a. Automatic wrapper adaptation by tree edit distance 

matching. Combinations of Intelligent Methods and Applications, 41-54. 
Ferrara, E. and Baumgartner, R. 2011b. Design of automatically adaptable web wrappers. In 

Proceedings of the 3rd International Conference on Agents and Artificial Intelligence. 211—217. 
Ferrara, E. and Baumgartner, R. 2011c. Intelligent self-repairable web wrappers. Lecture 

Notes in Computer Science 6934, 274-285. 
FlUMARA, G. 2007. Automated information extraction from web sources: a survey. Proc. of 

Between Ontologies and Folksonomies Workshop in 3rd International Conference on Commu- 
nities and Technology, 1-9. 
Flesca, S., Manco, G., Masciari, E., Rende, E., and Tagarelli, A. 2004. Web wrapper 

induction: a brief survey. AI Commun. 17, 2, 57-61. 
Furche, T., Gottlob, G., Grasso, G., Gunes, O., Guo, X., Kravchenko, A., Orsi, G., 

Schallhart, C, Sellers, A. J., and Wang, C. 2012. Diadem: domain-centric, intelligent, 

automated data extraction methodology. In WWW (Companion Volume). 267-270. 
Furche, T., Gottlob, G., Grasso, G., Schallhart, C, and Sellers, A. J. 2011. Oxpath: A 

language for scalable, memory-efficient data extraction from web applications. PVLDB 4, 11, 

1016-1027. 

Garrett, J. J. 2005. Ajax: A new approach to web applications. Tech. rep., Adaptive Path. 
Gatterbauer, W. 2009. Web harvesting. Encyclopedia of Database Systems, 3472-3473. 
Gatterbauer, W. and Bohunsky, P. 2006. Table extraction using spatial reasoning on the css2 

visual box model. In AAAI '06 Proc. of the 21st national conference on Artificial intelligence. 

AAAI Press, 1313-1318. 

Gjoka, M., Kurant, M., Butts, C, and Markopoulou, A. 2010. Walking in Faccbook: a 

case study of unbiased sampling of OSNs. In Proc. of the 29th Conference on Information 

Communications. IEEE, 2498-2506. 
Goga, O., Lei, H., Parthasarathi, S., Friedland, G., Sommer, R., and Teixeira, R. 2012. 

On exploiting innocuous user activity for correlating accounts across social network sites. Tech. 

rep., ICSI Technical Reports - University of Berkeley. 
Gottlob, G. and Koch, C. 2004a. Logic-based web information extraction. SIGMOD Rec. 33, 2, 

87-94. 

Gottlob, G. and Koch, C. 2004b. Monadic datalog and the expressive power of languages for 

web information extraction. J. ACM 51, 1, 74—113. 
Hammersley, B. 2005. Developing feeds with rss and atom. O'Reilly. 

Han, J. AND Kamber, M. 2000. Data mining: concepts and techniques. Morgan Kaufmann 
Publishers Inc., San Francisco, CA, USA. 

Hecht, B., Hong, L., Suh, B., and Chi, E. 2011. Tweets from Justin Bieber's heart: the dynamics 
of the location field in user profiles. In Proc. of the International Conference on Human Factors 
in Computing Systems (CHI 2011). ACM, Vancouver, British Columbia, Canada, 237-246. 

ACM Computing Surveys, Vol. V, No. N, July 2012. 



Web Data Extraction, Applications and Techniques: A Survey • 51 



Hsu, C.-N. AND Dung, M.-T. 1998. Generating finite-state transducers for semi-structured data 
extraction from the web. Inf. Syst. 23, 9, 521-538. 

Huck, G., Fankhauser, P., Aberer, K., and Neuhold, E. 1998. JEDI: Extracting and synthe- 
sizing information from the web. In Proc. of COOPIS. 

Iofciu, T., Fankhauser, P., Abel, F., and Bischoff, K. 2011. Identifying Users Across Social 
Tagging Systems. In Proc. of the International Conference on Weblogs and Social Media 
(ICWSM '11). Barcelona, Spain. 

Irmak, U. and Suel, T. 2006. Interactive wrapper generation with minimal user effort. In 
Proc. of the International Conference on World Wide Web (WWW 2006)). ACM, Edinburgh, 
Scotland, 553—563. 

Kaiser, K. and MlKSCH, S. 2005. Information extraction, a survey. Tech. rep., E188 - Institut 
fur Softwaretechnik und Intcraktivc Systeme; Technische Universitat Wien. 

Khare, R. and Celik, T. 2006. Microformats: a pragmatic path to the semantic web. In WWW 
'06: Proc. of the 15th international conference on World Wide Web. ACM, New York, NY, 
USA, 865-866. 

Kim, Y., Park, J., Kim, T., and Choi, J. 2007. Web information extraction by html tree 
edit distance matching. In International Conference on Convergence Information Technology. 
IEEE, 2455-2460. 

KlNSELLA, S., MuRDOCK, V., AND O'Hare, N. 2011. I'm eating a sandwich in Glasgow: modeling 
locations with tweets. In Proc. of the International Workshop on Search and Mining user- 
generated contents. ACM, Glasgow, UK, 61-68. 

Kleinberg, J. 2000. The small-world phenomenon: an algorithm perspective. In Proc. of the 
ACM symposium on Theory of Computing (STOC 2000). ACM, Portland, Oregon, USA, 163- 
170. 

Kleinberg, J. 2008. The convergence of social and technological networks. Communications of 
the ACM 51, 11, 66-72. 

KOHLSCHUTTER, C, Fankhauser, P., AND Nejdl, W. 2010. Boilerplate detection using shallow 
text features. In Proceedings of the third ACM international conference on Web search and 
data mining. ACM, 441-450. 

Krupl, B., Herzog, M., and Gatterbauer, W. 2005. Using visual cues for extraction of tabular 
data from arbitrary html documents. In WWW '05: Special interest tracks and posters of the 
Uth international conference on World Wide Web. ACM, New York, NY, USA, 1000-1001. 

Krupl-Sypien, B., Fayzrakhmanov, R. R., Holzinger, W., Panzenbock, M., and Baumgart- 
ner, R. 2011. A versatile model for web page representation, information extraction and content 
re-packaging. In ACM Symposium on Document Engineering. 129-138. 

Kurant, M., MARKOPOULOU, A., and Thiran, P. 2010. On the bias of breadth first search (bfs) 
and of other graph sampling techniques. In Proceedings of the 22nd International Teletraffic 
Congress. 1-8. 

KuSHMERlCK, N. 1997. Wrapper induction for information extraction. Ph.D. thesis, University 

of Washington. Chairperson- Weld, Daniel S. 
KuSHMERlCK, N. 2000. Wrapper induction: efficiency and expressiveness. Artif. Intell. 118, 1-2, 

15-68. 

KuSHMERlCK, N. 2002. Finite-state approaches to web information extraction. Proc. of 3rd 
Summer Convention on Information Extraction, 77-91. 

Kwak, H., Lee, C, Park, H., and Moon, S. 2010. What is Twitter, a social network or a news 
media? In Proc. of the International Conference on World Wide Web (WWW 2010). ACM, 
Raleigh, North Carolina, USA, 591-600. 

Laender, A. H. F., Ribeiro-Neto, B. A., da Silva, A. S., and Teixeira, J. S. 2002. A brief 
survey of web data extraction tools. SIGMOD Rec. 31, 2, 84-93. 

Li, B., Yang, Q., and Xue, X. 2009a. Can movies and books collaborate? cross-domain col- 
laborative filtering for sparsity reduction. In Proc. of the International Joint Conference on 
Artificial Intelligence (IJCAI 2009). Morgan Kaufmann Publishers Inc., Innsbruck, Austria, 
2052-2057. 

ACM Computing Surveys, Vol. V, No. N, July 2012. 



52 



Emilio Ferrara et al. 



Li, B., Yang, Q., and Xue, X. 2009b. Transfer learning for collaborative filtering via a rating- 
matrix generative model. In Proc. of the International Conference on Machine Learning (ICML 
2009). Montreal, Canada, 617-624. 

Liu, B. 2011. Structured data extraction: Wrapper generation. Web Data Mining, 363-423. 

Liu, L., Pu, C, AND Han, W. 2000. XWrap: An extensible wrapper construction system for 
internet information. In Proc. of ICDE. 

Manning, C. D. and Schutze, H. 1999. Foundations of statistical natural language processing. 
MIT Press, Cambridge, MA, USA. 

MASANES, J. 2006. Web archiving: issues and methods. Web Archiving, 1-53. 

Mathes, A. 2004. Folksonomies-cooperativc classification and communication through shared 
metadata. Computer Mediated Communication 4-7, 10. 

May, W., Himmeroder, R., Lausen, C, and Ludascher, B. 1999. A unified framework for 
wrapping, mediating and restructuring information from the web. In WWWCM. Sprg. LNCS 
1727. 

Melomed, E., Gorbach, I., Berger, A., and Bateman, P. 2006. Microsoft SQL Server 2005 
Analysis Services (SQL Server Series). Sams, Indianapolis, IN, USA. 

Meng, X., Hu, D., AND Li, C. 2003. Schema-guided wrapper maintenance for web-data extraction. 
In WIDM '03: Proc. of the 5th ACM international workshop on Web information and data 
management. ACM, New York, NY, USA, 1-8. 

Mislove, A., Marcon, M., Gummadi, K., Druschel, P., and Bhattacharjee, B. 2007. Mea- 
surement and analysis of online social networks. In Proceedings of the 7th SIGCOMM. ACM, 
29-42. 

MONGE, A. E. 2000. Matching algorithm within a duplicate detection system. IEEE Techn. 

Bulletin Data Engineering 23, 4. 
MuSLEA, I., MlNTON, S., AND Knoblock, C. 1999. A hierarchical approach to wrapper induction. 

In AGENTS '99: Proc. of the third annual conference on Autonomous Agents. ACM, New York, 

NY, USA, 190-197. 

Newman, M. 2003. The structure and function of complex networks. SIAM review, 167-256. 

ORO, E., Ruffolo, M., and Staab, S. 2010. Sxpath - extending xpath towards spatial querying 
on web documents. PVLDB 4, 2, 129-140. 

Pan, S. and Yang, Q. 2010. A survey on transfer learning. Knowledge and Data Engineering, 
IEEE Transactions on 22, 10, 1345-1359. 

Perito, D., Castelluccia, C, Kaafar, M., and Manils, P. 2011. How unique and traceable 
are usernames? In Privacy Enhancing Technologies. Springer, 1-17. 

Phan, X., HORIGUCHI, S., AND Ho, T. 2005. Automated data extraction from the web with 
conditional models. Int. J. Bus. Intell. Data Min. 1, 2, 194-209. 

Plake, C, Schiemann, T., Pankalla, M., Hakenberg, J., and Leser, U. 2006. Alibaba: 
Pubmed as a graph. Bioinformatics 22, 19, 2444-2445. 

Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. 
The VLDB Journal 10, 4, 334-350. 

Rahm, E. and Do, H. H. 2000. Data cleaning: Problems and current approaches. IEEE Bulletin 
on Data Engineering 23, 4. 

ROMERO, D. and Kleinberg, J. 2010. The Directed Closure Process in Hybrid Social-Information 
Networks, with an Analysis of Link Formation on Twitter. In Proceedings of the 4th Interna- 
tional Conference on Weblogs and Social Media. 

SahuGUET, A. and Azavant, F. 1999. Building light-weight wrappers for legacy web data-sources 
using w4f. In VLDB '99: Proc. of the 25th International Conference on Very Large Data Bases. 
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 738-741. 

SARAWAGI, S. 2008. Information extraction. Found. Trends databases 1, 3, 261-377. 

Schifanella, R., Barrat, A., Cattuto, C, Markines, B., and Menczer, F. 2010. Folks in 
folksonomies: social link prediction from shared metadata. In Proc. of the third ACM inter- 
national conference on Web Search and Data Mining (WSDM 2010). ACM Press, New York, 
USA, 271-280. 

ACM Computing Surveys, Vol. V, No. N, July 2012. 



Web Data Extraction, Applications and Techniques: A Survey • 53 



Selkow, S. 1977. The tree-to-tree editing problem. Information processing letters 6, 6, 184-186. 

Shi, Y., Larson, M., and Hanjalic, A. 2011. Tags as bridges between domains: Improving 
recommendation with tag-induced cross-domain collaborative filtering. In Proc. of the Inter- 
national Conference on User Modeling, Adaption and Personalization (UMAP 2011). Lecture 
Notes in Computer Science. Springer, Girona, Spain, 305—316. 

Spiegel, S., Kunegis, J., and F.Li. 2009. Hydra: a hybrid recommender system [cross-linked 
rating and content information]. In Proc. of the ACM International Workshop on Complex 
Networks Meet Information & Knowledge Management (CNIKM 2009). Hong Kong, China, 
75-80. 

Stewart, A., Diaz-Aviles, E., Nejdl, W., Marinho, L. B., Nanopoulos, A., and Schmidt- 
Thieme, L. 2009. Cross-tagging for personalized open social networking. In Proc. of the ACM 
International Conference on Hypertext and Hypermedia (HT 2009). Torino, Italy, 271-278. 

SzOMSZOR, M., Cantador, I., and Alani, H. 2008. Correlating user profiles from multiple 
folksonomies. In Proc. of the ACM International Conference on Hypertext and Hypermedia 
(HT'08). ACM, Pittsburgh, PA, USA, 33-42. 

Szomszor, M., Cattuto, C, Alani, H., OHara, K., Baldassarri, A., Loreto, V., and Serve- 
DIO, V. 2007. Folksonomies, the semantic web, and movie recommendation. In Proc. of the 
European Semantic Web Conference (ESWC 2007). Innsbruck, Austria. 

Turmo, J., Ageno, A., and Catala, N. 2006. Adaptive information extraction. ACM Comput. 
Surv. 38, 2, 4. 

VOSECKY, J., HONG, D., and Shen, V. 2009. User identification across multiple social networks. 

In Proc. of the International Conference on Networked Digital Technologies (NDT '09). IEEE, 

Ostrava, Czech Republic, 360-365. 
Wang, J., Shapiro, B., Shasha, D., Zhang, K., and Currey, K. 1998. An algorithm for finding 

the largest approximately common substructures of two trees. Pattern Analysis and Machine 

Intelligence, IEEE Transactions on 20, 8, 889-895. 
Wang, P., Hawk, W. B., and Tenopir, C. 2000. Users' interaction with world wide web resources: 

an exploratory study using a holistic approach. Inf. Process. Manage. 36, 229-251. 
Weikum, G. 2009. Harvesting, searching, and ranking knowledge on the web: invited talk. In 

WSDM '09: Proc. of the Second ACM International Conference on Web Search and Data 

Mining. ACM, New York, NY, USA, 3-4. 
Wilson, C, Boe, B., Sala, A., Puttaswamy, K., and Zhao, B. 2009. User interactions in social 

networks and their implications. In Proceedings of the J^th European Conference on Computer 

Systems. ACM, 205-218. 
Winkler, W. 1999. The state of record linkage and current research problems. In Statistical 

Research Division, US Census Bureau. 
WlNOGRAD, T. 1972. Understanding Natural Language. Academic Press, Inc., Orlando, FL, USA. 
YANG, W. 1991. Identifying syntactic differences between two programs. Software - Practice and 

Experience 21, 7, 739-755. 
Ye, S., Lang, J., AND Wu, F. 2010. Crawling Online Social Graphs. In Proceedings of the 12th 

International Asia-Pacific Web Conference. IEEE, 236-242. 
ZAFARANI, R. AND Liu, H. 2009. Connecting corresponding identities across communities. In 

Proc. of the International Conference on Weblogs and Social Media (ICWSM '09). The AAAI 

Press, San Jose, California, USA, 354-357. 
ZANASI, A. 1998. Competitive intelligence through data mining public sources. In Competitive 

intelligence review. Vol. 9. Wiley, New York, NY, ETATS-UNIS (1990-2001) (Revue), 44-54. 
Zhai, Y. and Liu, B. 2005. Web data extraction based on partial tree alignment. In WWW '05: 

Proc. of the Hth international conference on World Wide Web. ACM, NY, 76-85. 
Zhai, Y. and Liu, B. 2006. Structured data extraction from the web based on partial tree 

alignment. IEEE Trans, on Knowl. and Data Eng. 18, 12, 1614-1628. 
Zhang, K., Statman, R., and Shasha, D. 1992. On the editing distance between unordered 

labeled trees. Information processing letters ^2, 3, 133-139. 

ACM Computing Surveys, Vol. V, No. N, July 2012. 



54 



Emilio Ferrara et al. 



Zhang, Y., Cao, B., and Yeung, D. 2010. Multi-domain collaborative filtering. In Proc. of the 
26th Conference on Uncertainty in Artificial Intelligence (UAI). AUAI Press, Catalina Island, 
California, USA, 725-732. 

Zhang, Z., Zhang, C, Lin, Z., and Xiao, B. 2010. Blog extraction with template-independent 
wrapper. In Network Infrastructure and Digital Content, 2010 2nd IEEE International Con- 
ference on. IEEE, 313-317. 

Zhao, H. 2007. Automatic wrapper generation for the extraction of search result records from 
search engines. Ph.D. thesis, State University of New York at Binghamton, Binghamton, NY, 
USA. Adviser-Meng, Weiyi. 

Received: June 2010; revised: July 2012; 



ACM Computing Surveys, Vol. V, No. N, July 2012. 



